Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return error on LookupBySegment failure #524

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jun 13, 2023

Not quite sure what to do with this yet. This probably isn't the "fix", but I need to get this up before I forget the details of how I landed here.

Testing out selectors for our Trusted Gateway walks, I happen to have a DAG where I know it's incomplete—the wikipedia /wiki/ directory in fact, I don't have many of the pages and I'm expecting an error to be propagated out of the traversal through the sharded directory where it finds it can't load a block. The error occurs in go-unixfsnode/hamt/shardeddir.go's LookupBySegment(), and it's a not-found error from the blockstore, but it gets to here and there's a continue. i.e. the selector has an interest in this path segment, but it doesn't happen to be in the node in question.

So, our current Trusted Gateway validator (in Lassie) doesn't even reject the malformed DAG because of this. Where I expect a 5 block DAG, but give it a 3, it'll still pass as if everything's fine.

This goes back to the code that started using Interests() here: 31c9bb7

And it kind of aligns with the stated intention of Interests() as a hint, rather than a directive that it must be explored:

// S1. First, the traversal asks the Selector what its "interests" are.
// This lets the Selector hint to the traversal process what it should load,
// which can be important for performance if not all of the next data elements are in memory already.
// (This is applicable to ADLs which contain large sharded data, for example.)
// (The "interests" phase should be _fast_; more complicated checks, and anything that actually looks at the children, should wait until the "explore" phase;
// in fact, for this reason, the `Interests` function doesn't even get to look at the data at all yet.)
// T2. The traversal looks at the Node and its actual fields, and what the Selector just said are interesting,
// and between the two of them figures out what's actually here to act on.
// (Note that the Selector can say that certain paths are interesting, and that path can then not be there.)
// S2. Second, the code driving the traversal will ask us to "explore", **stepwise**.
// The "explore" step is applied **repeatedly**: once per pathSegment that identifies a child in the Node.
// (If `Interests()` returned a list, `Explore` will be called for each element in the list (as long as that pathSegment actually existed in the Node, of course);
// or if `Interest()` returned no guidance, `Explore` will be called for everything in the object.)

@rvagg rvagg requested review from willscott and hannahhoward June 13, 2023 12:00
@rvagg
Copy link
Member Author

rvagg commented Jun 13, 2023

The problem here exists in the nuance between "I don't have that field" (jedi hand-wave) and "I have the field but there was an error loading it".

We do have a mechanism for slicing this salami though, datamodel.ErrNotExists is an obvious error to check for, and if it's not, then return the error, otherwise continue. The problem is that this is not universally used. For instance we also have a schema.ErrNoSuchField that's used for typed nodes, go-unixfsnode will return this for the case of "I don't have that field" in fact. So that kind of sucks.

We could tackle this by doing an error inheritance, or custom Is(), or checking for both error types and then strictly stating that a contract for LookupByX() should return this error type, so every implementation (basic, typed, ADL) has to conform to get this behaviour. Not a small bugfix though!

@hannahhoward
Copy link
Collaborator

@rvagg rather than try to ignore the errors when it's not a "loading error", I think what we want is to capture the specific errors that come from loading errors. I think that means wrapping the errors in loadLink (

func (prog Progress) loadLink(lnk datamodel.Link, v datamodel.Node, parent datamodel.Node) (datamodel.Node, error) {
) with something easily identifiable. That code seems much closer and easier to cover comprehensively.

@rvagg
Copy link
Member Author

rvagg commented Jun 13, 2023

loadLink isn't used by ADLs that load their own nodes, in the particular case I'm hitting of pathing through a sharded directory, the link loading happens in go-unixfsnode and the error comes directly in through LookupBySegment, the last link load in the selector code is the block that needs to be reified into a sharded directory.

https://github.com/ipfs/go-unixfsnode/blob/1f8efbad2b0d2820b468b62e720d1b257c40db65/hamt/shardeddir.go#L152-L155

@rvagg
Copy link
Member Author

rvagg commented Jun 14, 2023

Some next steps here after a brief discussion:

  1. We should be able to handle this problem locally where we care about it (i.e. in our validation and other places we have control and can, and want to, signal errors) by wrapping the linksystem used for traversal. Since refiers are passed a linksystem, any failed loads should be signalled through that so we get a side-channel to block load errors.
  2. Investigate the possibility of an ErrNotExists contract, around LookupBy*() methods. At least something we can errors.Is() on. Alternatively we have a list of acceptable error types here to ignore, including currently used types. Any other error type will trigger a fatal in a traversal.

Hannah says that graphsync already does something like (1) above and should be aware of bad blocks in a DAG already, regardless of what the traversal does. That means the blast radius of this problem is less than it might otherwise be.

rvagg added a commit to filecoin-project/lassie that referenced this pull request Jun 29, 2023
* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
rvagg added a commit to filecoin-project/lassie that referenced this pull request Jun 29, 2023
Closes: #336

* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
rvagg added a commit to filecoin-project/lassie that referenced this pull request Jun 29, 2023
Closes: #336

* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
@rvagg
Copy link
Member Author

rvagg commented Jun 29, 2023

Workaround in lassie for the two main traversal executions: filecoin-project/lassie#338 - watch for load errors on the linksystem and check for them after we've got no errors from the traversal.

rvagg added a commit to filecoin-project/lassie that referenced this pull request Jul 3, 2023
Closes: #336

* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
rvagg added a commit to filecoin-project/lassie that referenced this pull request Jul 4, 2023
Closes: #336

* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
rvagg added a commit to filecoin-project/lassie that referenced this pull request Jul 4, 2023
Closes: #336

* don't treat context cancellation as an ErrExtraneousBlock in the CAR verifier
* capture and properly handle block load errors that are missed by the
  go-ipld-prime traverser, ref: ipld/go-ipld-prime#524
* fix flaky case(s) in verifiedcar test suite where multi-level sharded
  directory is assumed but only a single block dir is produced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants