Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polkadot: pin one block per session #1220

Merged
merged 10 commits into from
Sep 7, 2023
Merged

Conversation

ordian
Copy link
Member

@ordian ordian commented Aug 28, 2023

Fixes #623.

Problem

When an invalid parachain block (candidate) gets backed and included, an honest validator doing approval-checking work will raise a dispute. Assuming the honest supermajority, the dispute will conclude against the candidate. That means we will revert to the block right before the inclusion of the invalid candidate and transplant the results of the dispute to the next block built on top (without the inclusion). So far so good.

After this happens, we want to slash the offending backers. Normally, this would happen right in the same block when the dispute concluded. However, if the inclusion happened in a past session, this might happen in the following blocks. For past session slashing to work, we need the state of a block in this past session in order to query the key ownership proof on the node side.
The problem is that inclusion blocks of the candidate might be pruned as a fork of a finalized chain as you can see from the logs:

2023-09-04 15:34:43.050  INFO tokio-runtime-worker parachain::dispute-coordinator: Dispute on candidate concluded with 'invalid' result candidate_hash=0x8e0b19f6d6ef2180b7de85377c5b1ea49bfe06a2667118621bb6b4125799cc4e session=1 traceID=188808017
2023-09-04 15:34:43.777 DEBUG tokio-runtime-worker parachain::approval-voting: approved blocks 41-[11 111111111]-31
2023-09-04 15:34:44.934  INFO tokio-runtime-worker substrate: 💤 Idle (5 peers), best: #41 (0x582e…7bbc), finalized #30 (0xdd25…2f3b), ⬇ 5.6kiB/s ⬆ 5.0kiB/s
2023-09-04 15:34:45.780 DEBUG tokio-runtime-worker parachain::approval-voting: Block finalized block_hash=0xb981441ffd53e96053e5126d63df0d23832fa09811dc15be8bcb4a88df2eab49 block_number=38
2023-09-04 15:34:48.041  INFO tokio-runtime-worker substrate: ✨ Imported #42 (0x846e…bf3a)
2023-09-04 15:34:48.045  INFO tokio-runtime-worker parachain::dispute-coordinator: Processing unapplied validator slashes session_index=1 candidate_hash=0x8e0b19f6d6ef2180b7de85377c5b1ea49bfe06a2667118621bb6b4125799cc4e n_slashes=1 traceID=18880
8017283788890442224175552623484580
2023-09-04 15:34:48.045  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x2f095f2550a11c0889c04dad89b22419f72c297960cd9ba1891da0b9b198794e
2023-09-04 15:34:48.045 DEBUG tokio-runtime-worker parachain::dispute-coordinator: Key ownership proof not yet supported. session_index=1 candidate_hash=0x8e0b19f6d6ef2180b7de85377c5b1ea49bfe06a2667118621bb6b4125799cc4e validator_id=Public(bc646
2023-09-04 15:34:48.045  WARN tokio-runtime-worker parachain::dispute-coordinator: Could not generate key ownership proofs for 1 keys session_index=1 candidate_hash=0x8e0b19f6d6ef2180b7de85377c5b1ea49bfe06a2667118621bb6b4125799cc4e traceID=18880
8017283788890442224175552623484580
2023-09-04 15:34:54.034  INFO tokio-runtime-worker parachain::dispute-coordinator: Couldn't find inclusion parent for an unapplied slash

Solution

To address this problem, we pin the state of 1 block per session for DISPUTE_WINDOW (which is 6 at the time of writing) sessions. See 5cba653. Given that we know the blocks that are not pruned, we can make guarantee that runtime APIs to that block will not fail.

Alternatives

  1. Instead of storing (pinning) the whole state of the block, why don't we collect just key ownership proofs of backers?
    Well, the problem with this approach, is that it assumes that the node side can predict who is going to be slashed in the runtime before it happens. The logic of slashing is encapsulated in the runtime and we don't want to leak that to the node side. Besides, there might be other storage items we might want to query in the future.

  2. Pin inclusion blocks as soon as we see f+1 votes against a candidate. The problem with this approach is described below: polkadot: pin one block per session #1220 (comment)

@ordian ordian added R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”. T8-parachains_engineering labels Aug 28, 2023
sandreim
sandreim previously approved these changes Aug 28, 2023
polkadot/node/core/chain-api/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/core/chain-api/src/metrics.rs Outdated Show resolved Hide resolved
eskimor
eskimor previously approved these changes Aug 28, 2023
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ordian !

polkadot/node/core/chain-api/src/lib.rs Outdated Show resolved Hide resolved
"All good. Unpinning the inclusion blocks",
);
for (_number, hash) in inclusions {
ctx.send_message(ChainApiMessage::UnpinBlock(hash)).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mall issue. While we pin the block the scraper will eventually fade out blocks that fell behind finality too much.

This might be hard to exploit, as slashing should not take too long (always true?!), but we also have an lru that might just get full. I am wondering whether we can not do better? It looks like we would not strictly need the precise relay parent to be around. We should be able to make a block of the same session suffice.

With this, couldn't we make this more robust and more efficient at the same time, by making sure to always have one relay block available for every session in dispute window size and use this for any proofs/session fetches/..?

Technically with probabilistic finality fixed, this should not be too hard as that relay block could then always be on the canonical chain.

Interested in your thoughts. I am definitely not opposing the current solution, it should definitely get a security audit though. I feel there could be exploitable edge cases.

I definitely love the explicit block pinning!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion, thank you! Currently tinkering with an implementation for it to see if there are any problems with it.

@ordian ordian marked this pull request as draft August 30, 2023 11:56
@ordian ordian force-pushed the ao-polkadot-use-pinning-api branch from a3f6021 to 5cba653 Compare August 31, 2023 10:05
@ordian ordian changed the base branch from ao-expose-client-pinning-api to master August 31, 2023 10:05
@ordian ordian changed the title polkadot: pin inclusion blocks until slashed polkadot: pin one block per session Aug 31, 2023
ordian added 2 commits August 31, 2023 16:00
* master:
  Sassafras primitives (#1249)
  Restructure `dispatch` macro related exports (#1162)
  backing: move the min votes threshold to the runtime (#1200)
  Bump zstd from 0.11.2+zstd.1.5.2 to 0.12.4 (#1326)
  Remove `substrate_test_utils::test` (#1321)
  remove disable-runtime-api (#1328)
@ordian ordian dismissed stale reviews from eskimor and sandreim August 31, 2023 14:02

stale

@ordian ordian requested review from eskimor and sandreim August 31, 2023 14:02
* master: (25 commits)
  Markdown linter (#1309)
  Update `fmt` file and some authors (#1379)
  Bump the known_good_semver group with 1 update (#1375)
  Bump proc-macro-warning from 0.4.1 to 0.4.2 (#1376)
  feat: add futures api to `TransactionPool` (#1348)
  Ensure cumulus/bridges is ignored by formatter and run it (#1369)
  substrate: chain-spec paths corrected in zombienet tests (#1362)
  contracts: Update to wasmi 0.31 (#1350)
  [improve docs]: Template pallet (#1280)
  [xcm-emulator] Unignore cumulus integration tests (#1247)
  Fix wrong ref counting (#1358)
  Use cached session index to obtain executor params (#1190)
  fix typos (#1339)
  Use bandersnatch-vrfs with locked dependencies ref (#1342)
  Bump bs58 from 0.4.0 to 0.5.0 (#1293)
  Contracts: `seal0::balance` should return the free balance (#1254)
  Logs: add extra debug log for negative rep changes (#1205)
  Added short-benchmarks for cumulus (#1183)
  [xcm-emulator] Improve hygiene and clean up (#1301)
  Bump the known_good_semver group with 1 update (#1347)
  ...
@ordian ordian marked this pull request as ready for review September 4, 2023 12:01
@ordian
Copy link
Member Author

ordian commented Sep 4, 2023

This is ready for the review, but I can't seem to repro the error observed without this changes 🙈
Also
need to fix the test in CI (again, can't repro locally).

}

#[test]
fn forward_subsystem_works() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this test to break (dev-) cycle dependency. Also, in general, I am not convinced we need tests for tests.

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This is code how I like it!

status: LeafStatus::Fresh,
span: Arc::new(jaeger::Span::Disabled),
},
fresh_leaf(*new_head, number),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the side: We found this fresh/stale mechanism to be redundant. We should remove it at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed it to new_leaf to account for this issue being worked on. Also with this helper function, we won't have to modify tons of files to implement this issue.

pub number: BlockNumber,
/// A handle to unpin the block on drop.
pub unpin_handle: UnpinHandle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 😍

///
/// This is useful for runtime API calls to blocks that are
/// racing against finality, e.g. for slashing purposes.
pub type UnpinHandle = sc_client_api::UnpinHandle<Block>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! No races against finality anymore! This should do away with a couple of issues, you have been looking into @tdimitrov !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All we need to do is keep the leaf alive, while we work on it.

@@ -74,6 +75,10 @@ pub struct RuntimeInfo {
/// Look up cached sessions by `SessionIndex`.
session_info_cache: LruMap<SessionIndex, ExtendedSessionInfo>,

/// Unpin handle of *some* block in the session.
/// Only blocks pinned explicitly by `pin_block` are stored here.
pinned_blocks: LruMap<SessionIndex, UnpinHandle>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ordian . This is a very elegant solution.

* master: (28 commits)
  Adds base benchmark for do_tick in broker pallet (#1235)
  zombienet: use another collator image for the slashing test (#1386)
  Prevent a fail prdoc check to block (#1433)
  Fix nothing scheduled on session boundary (#1403)
  GHW for building and publishing docker images (#1391)
  pallet asset-conversion additional quote tests (#1371)
  Remove deprecated `pallet_balances`'s `set_balance_deprecated` and `transfer` dispatchables (#1226)
  Fix PRdoc check (#1419)
  Fix the wasm runtime substitute caching bug (#1416)
  Bump enumn from 0.1.11 to 0.1.12 (#1412)
  RFC 14: Improve locking mechanism for parachains (#1290)
  Add PRdoc check (#1408)
  fmt fixes (#1413)
  Enforce a decoding limit in MultiAssets (#1395)
  Remove dynamic dispatch using `Ext` (#1399)
  Remove redundant calls to `borrow()` (#1393)
  Get rid of polling in `WarpSync` (#1265)
  Bump actions/checkout from 3 to 4 (#1398)
  Bump thiserror from 1.0.47 to 1.0.48 (#1396)
  Move Relay-Specific Shared Code to One Place (#1193)
  ...
@ordian ordian enabled auto-merge (squash) September 7, 2023 08:33
* master:
  Forgotten `polkadot-core-primitives/std` (#1440)
@ordian ordian merged commit 1550388 into master Sep 7, 2023
@ordian ordian deleted the ao-polkadot-use-pinning-api branch September 7, 2023 10:24
Ank4n pushed a commit that referenced this pull request Sep 8, 2023
* polkadot: propagate UnpinHandle to ActiveLeafUpdate

Also extract the leaf creation for tests
into a common function.

* dispute-coordinator: try pinned blocks for slashin

* apparently 1.72 is smarter than 1.70

* address nits

* rename fresh_leaf to new_leaf
Daanvdplas pushed a commit that referenced this pull request Sep 11, 2023
* polkadot: propagate UnpinHandle to ActiveLeafUpdate

Also extract the leaf creation for tests
into a common function.

* dispute-coordinator: try pinned blocks for slashin

* apparently 1.72 is smarter than 1.70

* address nits

* rename fresh_leaf to new_leaf
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/stalled-parachains-on-kusama-post-mortem/3998/1

bkchr pushed a commit that referenced this pull request Apr 10, 2024
* refactor finality relay helper definitions

* add missing doc

* removed commented code

* fmt

* disable rustfmt for macro

* move best_finalized method const to relay chain def
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Figure out raciness of past session slashing
4 participants