-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - fix deneb sync bug #4869
Conversation
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root. I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block. Example output: ``` Oct 20 21:13:36.700 DEBG Successfully verified gossip blob commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568 Oct 20 21:13:36.785 DEBG Successfully verified gossip block commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon ``` Example output in a block with no blobs (this will show up pre-deneb): ``` 426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765 ``` 2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug ``` AvailabilityCheckError::KzgCommitmentMismatch { blob_commitment: KzgCommitment, block_commitment: KzgCommitment, }, ``` which is output as just `AvailabilityCheckError` 3. Some additional misc sync logs I found useful in debugging #4869 4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests Co-authored-by: realbigsean <seananderson33@gmail.com>
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root. I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block. Example output: ``` Oct 20 21:13:36.700 DEBG Successfully verified gossip blob commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568 Oct 20 21:13:36.785 DEBG Successfully verified gossip block commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon ``` Example output in a block with no blobs (this will show up pre-deneb): ``` 426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765 ``` 2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug ``` AvailabilityCheckError::KzgCommitmentMismatch { blob_commitment: KzgCommitment, block_commitment: KzgCommitment, }, ``` which is output as just `AvailabilityCheckError` 3. Some additional misc sync logs I found useful in debugging #4869 4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests Co-authored-by: realbigsean <seananderson33@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Not that I understand sync all that well but this seems to make sense. Lighthouse was continuously stalling yesterday when syncing devnet-10
and then pawan pointed me to this PR. Unfortunately devnet-10
was shut down by the time I tried this PR this morning but lighthouse does seem to sync fine on devnet-11
. This devnet is much newer though so we might have to wait a bit before we can hit this bug again on devnet-11
and retry..
bors r+ |
## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>
Build failed: |
bors retry |
## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>
Pull request successfully merged into unstable. Build succeeded! The publicly hosted instance of bors-ng is deprecated and will go away soon. If you want to self-host your own instance, instructions are here. If you want to switch to GitHub's built-in merge queue, visit their help page.
|
Issue Addressed
I observed our forward sync on devnet 9 would stall when we would hit this log:
Proposed Changes
range_sync_block_and_blob_response
here removes the request from the sync manager. later, however if there's an error,inject_error
here expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync