Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

Closed
aarshkshah1992 opened this issue Jul 24, 2024 · 10 comments
Labels
Milestone

Comments

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Jul 24, 2024

This issue aims to be a meta-issue to capture and track work that needs to be done to enhance correctness, performance, and stability of the ETH RPC API on snapshot synced nodes. Note that improving performance for ETH RPC API on archival nodes is out of scope for this issue and will be addressed by a future issue.

Our goal is to improve the developer experience (DX) for key partners, including:

  1. ETH Subgraph Providers (e.g., Vulcanize)
  2. EVM Explorers (e.g., Blockscout)
  3. Cross-chain Bridges (e.g., Axelar)
  4. DApp Developers using existing ETH tooling which relies on ETH RPC API

Correctness and data availability issues in the chain state Indexes used by the ETH RPC API

Currently, we maintain three primary indices on the chain state, which are essential for both correctness and performance of multiple ETH RPC APIs.

Transaction Index

  • Maps an ETH transaction hash to a Filecoin message CID.
  • This mapping is one-way; if it's not in the index, there's no way to go from an ETH transaction hash to a Filecoin message CID.
  • APIs depend on this index to lookup the Filecoin message CID for a given ETH transaction hash, and then lookup the message by CID to get the actual transaction data.
  • If an entry is not found in this Index for a given transaction hash, the subsequent lookup for the message will be slow and likely fail if the query pertains to an ETH transaction.

Message Index

  • Maps a Message("transaction") to it's tipset and block number
  • Heavily used by ETH RPC to lookup the tipset containing a given message CID to get the full message/receipts
  • A miss on this Index leads to a painfully slow search through the state store for the requested message

Event Index

  • Associates each tipset/block with the event logs for all transactions contained within.
  • This index is crucial for the functionality of the ETH Event APIs, which are extensively utilized by subgraph providers and bridges.
  • Issues with data availability in this index can lead to costly recomputation of tipsets to regenerate events or result in the Events API failing to return events that should be available—assuming the corresponding tipset is present in the chain state, these events should be reliably indexed.

All of the above indices suffer from some or all of the following problems that need to be fixed:

  • They're not reset and rehydrated when a node syncs from a snapshot
  • They're not automatically backfilled on node startup
  • The lotus-shed backfilling CLI that users rely on for manually backfilling the indices is broken as all the Indices are persisted in Sqlite and Sqllite only supports a single writer. This effectively means that backfilling races with indexing new/ongoing state transitions
  • Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head
  • They are not subject to garbage collection when the Splitstore is GC'd, which means they retain data that will never be accessed and the indices keep growing in size. Although the space used by these indices is minimal compared to the overall chainstate, the accumulation of unnecessary data could potentially slow down index queries.
  • Despite being operational for over a year, features like the Message Index are still labeled as "EXPERIMENTAL" which cause confusion among Node operators.
  • Better logging and telemetry around index usage and backfilling.
  • Consistency checks and auto-repair mechanisms in the Indices to 1) Flag missing data in the indices for epochs that have achieved finality and 2) Lazy backfilling of missing data in the indices on user requests.

Correctness problems in the ETH Events API

In-memory block caching for perf improvements

Multiple ETH RPC APIs frequently need to lookup Filecoin Tipsets and convert them to the correspondong Ethereum block representations. These lookups are performed on the chainstore which is expensive. We should cache these tipsets/blocks in an LRU cache. See #10520.

Miscellaneous correctness bugs from the backlog

@aarshkshah1992 aarshkshah1992 added the kind/feature Kind: Feature label Jul 24, 2024
@aarshkshah1992 aarshkshah1992 changed the title Correctness and Performance problems in ETH RPC: Meta Issue Meta Issue: Fixing high impact correctness and performance problems in ETH RPC Jul 24, 2024
@aarshkshah1992 aarshkshah1992 changed the title Meta Issue: Fixing high impact correctness and performance problems in ETH RPC Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes Jul 24, 2024
@snissn
Copy link
Contributor

snissn commented Jul 24, 2024

For any and all of these that have ways to reproduce it would be great to coordinate and add the RPC call behind any of these issues to the RPC benchmark tool that Fil-B has been maintaining - https://github.com/fil-builders/benchmark-rpc/blob/main/pages/index.js#L21

For example this issue #10940 should be easy to reproduce in a live test. I created a ticket for it FIL-Builders/benchmark-rpc#1 to add it to the web app http://benchmark-rpc.fil.builders/

@rvagg
Copy link
Member

rvagg commented Jul 25, 2024

They're not reset and rehydrated when a node syncs from a snapshot

msgindex is @

if err := index.PopulateAfterSnapshot(ctx, filepath.Join(basePath, index.DefaultDbFilename), cst); err != nil {

It at least has a pattern we can follow for others. But it also overlaps with a backfill operation, so we may end up taking care of snapshot import with a general backfill routine if we get that right.

@rjan90 rjan90 added this to FilOz Jul 25, 2024
@rjan90 rjan90 moved this to 🐱Todo in FilOz Jul 25, 2024
@rjan90 rjan90 moved this from 🐱Todo to 📌 Triage in FilOz Jul 25, 2024
@Stebalien
Copy link
Member

This list seems pretty complete. IMO, the highest priority is fixing the indexing issues:

  1. P0: Make sure that we backfill to the last indexed tipset on restart. We should never have "holes".
  2. P1: Make sure that we wait for the index in some cases, or even eagerly force it. E.g., as we discussed on the call, all of the EthGetBlockBy* commands should force the indexer to index that block (and its parents).

Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head

IMO, the best way to handle this case is the dance we discussed on the call:

  1. Check the index. If present, return.
  2. Wait for the index to reach the current head.
  3. Check again.

This will miss uncles, but StateSearchMsg is designed to only find messages on the main chain.


I took a look at how geth handles stuff like this and... they also appear to index asynchronously and handle this case by returning an error if the node is currently indexing a block. That's not a terrible option... but it would be a larger breaking change.

@rjan90 rjan90 added this to the DX-Streamline milestone Aug 2, 2024
@BigLep
Copy link
Member

BigLep commented Aug 6, 2024

This is a great overview @aarshkshah1992 - thanks for writing it up. A few questions, some of which are coming from a newbie/ignorant-of-the-code perspective. I'm happy to chat on any of these elsewhere or offline, but figured to ask here so it's public.

  1. Can any of the the ETH RPC code simplify when we have fast finality (F3)? If so, given we’re so close to that being live on the network (less than 2 months), should we do work here with that assumption in mind?
  2. Assuming we "make sure that we backfill to the last indexed tipset on restart so we never have holes" and we "make sure that we wait for the index" can we simplify the code failing fast if we have an index miss? I'm eyeing your statements in the issue description about “the subsequent lookup for the message will be slow and likely fail if the query pertains to an ETH transaction” and “A miss on this Index leads to a painfully slow search through the state store for the requested message”. It's presumably not good to be in this state, but would it better off to fail fast then do expensive state traversals? (I assume private Lotus RPC methods can be invoked if someone needs the slow path.)
  3. Is there value in having separate db's? (I don't have insight here - just curious.). I see there is an issue about consolidating in Enable efficient indexing of historical chain data #10807 . (Maybe it's best to discuss this question there.)

@aarshkshah1992
Copy link
Contributor Author

@BigLep

  1. I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.
  2. I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

Let me look into 2.

@Stebalien
Copy link
Member

I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.

That depends on what we do with the API. If F3 is "fast enough", we could just not expose anything after finality. But... that's probably not going to work well.

I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

I agree there's no reason to keep them separate.

@aarshkshah1992
Copy link
Contributor Author

aarshkshah1992 commented Aug 8, 2024

@Stebalien @BigLep @rvagg

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

@BigLep
Copy link
Member

BigLep commented Aug 22, 2024

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

For visibility, it was decided that it would be useful to merge the DBs into a single DB. The work is happening in #12421

@BigLep
Copy link
Member

BigLep commented Jan 8, 2025

@rvagg : how much of this is still relevant? Can we close out any of the linked items?

@rvagg
Copy link
Member

rvagg commented Jan 9, 2025

We're not actively getting complaints or input on performance related to these so I think we can close and open new issues as things arise.

@rvagg rvagg closed this as completed Jan 9, 2025
@github-project-automation github-project-automation bot moved this from ⌨️ In Progress to 🎉 Done in FilOz Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🎉 Done
Development

No branches or pull requests

6 participants