-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(multicodecs): add filecoin multicodecs #161
Conversation
add serialization and hashing codecs for filecoin
table.csv
Outdated
@@ -429,3 +429,8 @@ holochain-key-v0, holochain, 0x947124, Holochain v0 pub | |||
holochain-key-v1, holochain, 0x957124, Holochain v1 public key + 8 R-S (63 x Base-32) | |||
holochain-sig-v0, holochain, 0xa27124, Holochain v0 signature + 8 R-S (63 x Base-32) | |||
holochain-sig-v1, holochain, 0xa37124, Holochain v1 signature + 8 R-S (63 x Base-32) | |||
fil-piece-unsealed, serialization, 0xfi01, Filecoin piece, raw data (CID = Piece Commitment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the format of this data? CAR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be IPLD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's neither. The CID here is the root of the merkletree constructed from the bytes that make up the CAR.
table.csv
Outdated
fil-sector-unsealed, serialization, 0xfi02, Filecoin sector, raw data (CID = Data commitment) | ||
fil-sector-sealed, serialization, 0xfi03, Filecoin sector, sealed and replicated (CID = Replication Commitment) | ||
fil-hash-unsealed, multihash, 0xfi04, Filecoin unsealed commitment hash (custom hashing alg) | ||
fil-hash-sealed, multihash, 0xfi05, Filecoin sealed commitment hash (custom hashing alg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the actual hashing algorithm? Or is this some kind of special merkle-tree based hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to @porcuquine, it is indeed custom. @porcuquine describes it as "SHA256 with a twist" -- different enough that he says it shouldn't just be SHA-256
What are the size constraints on these codes? I.e., what goes on chain? |
The size of CommP at least is 32 bytes. |
But @jbenet would like it to be an actual CID -- as in: |
I'd ask everyone to pretend I didn't just act like the letter 'i' is a valid hex character. |
fil-piece-unsealed, serialization, 0xf101, Filecoin piece- raw data (CID = Piece Commitment) | ||
fil-sector-unsealed, serialization, 0xf102, Filecoin sector- raw data (CID = Data commitment) | ||
fil-sector-sealed, serialization, 0xf103, Filecoin sector- sealed and replicated (CID = Replication Commitment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the documentation available about the structure of all those ones? Are those really unique formats or just blobs of bytes? Multicodecs are for formats, not identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannahhoward the piece commitment and the data commitment should be the same, only need one of those
To be able to resolve those nodes the codec should indicate the depth of the graph, otherwise you can't know if you have an intermediate node or a leaf For a 32G sector this means we'd need codecs for 30 levels (each level is 32 bytes), and possibly more levels for a bit of future-proofing (64 for a nice round number?) |
I'd really rather not abuse codecs that way. I'm going to try to think up a better approach. |
Edit: unnecessary detail in here, will add a new comment below with pertinent details.
This is related to a problem I'm having right now doing CID stuff for blockchain data in that we're addressing merkle nodes and not an entire block that we can hash. I'd really like to have a consistent approach to this that we can reuse.
The CommP as a "CID" is really just a CID for a 64-byte string that's a concatenation of two hashes. It's plain SHA2-256 over that. So it could be locked in as that. You could, if you wanted to, make a CID using the same coding for each node in the CommP merkle, all the way down to the base 32-byte hashes of the underlying data. What I'm trying to solve for right now in serializing BTC blocks is getting from the header to the transactions by way of a 2-ary merkle tree where the root is all the header has. go-ipld-btc takes the interesting approach of overloading the It seems that we need a more consistent way of addressing merkle structures with CIDs because they pop up so frequently. But in the meantime, saying that |
I think you know this, and it's tangential to what you are discussing, but I want to remind you that the CommP/CommD hash is not just SHA2-256. It's SHA2-256, with truncation to 254 bits (which I'm intentionally not fully specifying here). I mention it mainly so the description above isn't taken out of context as a specification by some future reader. |
Edit: unnecessary detail in here, will add a new comment below with pertinent details.
@porcuquine
But that's only at the base of the merkle tree in the underlying data, for the purpose of Edit: it does mean that when you get to the base of the merkle tree, if you were addressing nodes by this CID style, you'd be addressing something different, not the concatenation of two hashes, but the underlying data with 2-bit spacing at 254-bit intervals, which is weird, but the merkle tree doesn't care. There's currently no mechanism to say that "this CID addresses all of this underlying data", it only addresses an immediate "block" which can be passed through some hashing algorithm. i.e. it breaks down if you want to use a CID to address anything beyond a node in a merkle tree. But for the purpose of https://github.com/filecoin-project/go-fil-commcid, simplifying it to say that's all it is is probably enough, it's just not very satisfying when you push the |
I'm not sure if we are talking past each other or not. In Filecoin, the Merkle trees whose roots are called either CommP or CommR — are composed of binary hashes which are as I described, not just SHA2-256 without the truncation. This is important because each node in the tree has to fit into one BLS12-381 field element when included as a private input to the Merkle inclusion proofs contained in the PoRep SNARK. |
Edit: unnecessary detail in here, will add a new comment below with pertinent details.
@porcuquine oh yeah, sorry about that .. https://github.com/filecoin-project/rust-fil-proofs/blob/d7896c29ef3c0cc8c04f9fab7ef434e6691ed480/storage-proofs/core/src/hasher/sha256.rs#L75 and I even implemented that in javascript so should have remembered! https://github.com/rvagg/js-fil-utils/blob/52255d1603ac49d912d8a1ede66bab5c0baa228b/merkle.js#L10
Well that complicates it further then. Hash algorithm is something like "sha2-256-254" which maybe needs a multihash entry of its own to make this work. |
Edit: unnecessary detail in here, will add a new comment below with pertinent details.
... and that's what this PR is for, the sha2-256-254 _multihash_, with the _multicodec_ for CommP currently being ["raw"](https://github.com/filecoin-project/go-fil-commcid/blob/2b8bd03caca59c436d953f32fd825c055b612e18/commcid.go#L69) which probably isn't right and may need its own entry.
I guess for sealed data, describing it as sha2-256-254 doesn't cut it since those extra 2 bits are going to be filled with something that we can't specify without additional context? For unsealed we can at least say say |
Hence my mild panic.
I'm not 100% sure what you're discussing, so I'll be guarded. SHA2-256 doesn't contribute to any root of sealed data. The only Merkle trees created on sealed data use Poseidon with a more complicated tree structure. |
Sorry for my spam in here @hannahhoward et. al. I've @hannahhoward one thing I would like to know is whether this PR or https://github.com/filecoin-project/go-fil-commcid/blob/master/commcid.go reflects current desired state. The latter uses |
(Skip down to "Suggestions" at the bottom of this you 'aint got time for all this text, scroll back up for background.) We're trying to extend CID & multiformats a bit beyond what they should be used for here I think, but the edges are blurry already so let's be clear before proceeding. Please excuse my basic framing here, I know it's probably tedious for some. What are these values we're trying to identify:
CIDs and muiltihashes A CID is:
It's for content-addressed data where (This is being baked into the IETF standard submission for multihash @ https://datatracker.ietf.org/doc/draft-snell-multihash/) To date, as far as I'm aware, all of the Multihash & this PR As it stands, I believe (my reading could be faulty!) this PR is proposing to extend the multihash concept and say that: Compounding this, I believe the sealed "hashing" process would use additional context external to the algorithm (for the ZK proof) making it impossible, or impractical for anyone to take You could take a CID & this PR As per the definition above, a CID is a So considering this What we need to resolve For multihash:
e.g. for CommP:
For CID / multicodec:
This PR uses Suggestions We have 2 get-out-of-jail tools here that we can fall back on if needed:
I think "well-established cryptographic hash functions" disqualifies the addition of the new multihashes. Both "well-established" and "hash functions" trip us up. I think even that At this stage I think the multihash for these things, if one is required (i.e. to make a CID), should be For the other part, the "multicodec", it doesn't seem as clear either way. I don't think We've already set the precedent that Any requirement for the abiltiy to differentiate algorithm in the future can iterate in the Example of a CommP to a CID using this:
If we have something that's not a |
Quick notes -- not done but out of time, will return later on --
|
So we need to address the flexibility of the definition of "multihash" first. There's 3 approaches:
Having talked it through with @porcuquine, I've put up #170 and #171 to cover approach 2. I buy Poseidon as a valid multihash entry, it's just got a lot of possible permutations so we need to make sure we fit just enough (but not too much) information into the entries to sufficiently differentate them. If we accept those as valid extensions to multihash, then #172 could wrap them up as CIDs. Borrowing the first 3 entries from this PR. But we'd have to be clear that these identifiers, in the |
Summary of actions to get this closed out, please skim this and register any disagreement now @Stebalien @jbenet @whyrusleeping @porcuquine @mikeal @magik6k @dignifiedquire:
|
that works for me 👍 thank you very much for taking this on @rvagg -- i know this is really tricky work, with lots of ramifications, and i really appreciate the care and thoughtfulness you put into the whole thread of things |
Amen. |
We had a bit of a diversion in #172 over the nature of the "codecs", thanks to @hannahhoward for the pinging us on that. My original attempt to shoehorn these in as IPLD-like codecs was a failure, but @vmx and I are mostly comfortable with moving forward with simply adding them as identifier descriptors. So the multihash says "what hash function was involved in generating this content address" and the codec says "what type of content is this addressing" while stopping a bit short of saying "how can I decode the data if I were to fetch it"—which is where we fail at these being true IPLD-like codecs. We get to differentiate between sealed and unsealed addresses (CIDs), but you couldn't practically traverse into them like an IPlD codec. The tag stays as We also had a good discussion about qualification for this table and how we need a clearer description of what gets to be in here and what doesn't. The purpose needs to be more clearly stated so contributions to the table are easier to approve or reject. There also needs to be some room for contributors to be able to be able to justify inclusion for their own reasons if there isn't an obvious disqualification according to the purpose of the table. In the case of Filecoin, we still don't fully understand why CIDs would be useful for each of the 3 cases, or where they might get used outside of the system for them to need an entry in here. If someone could explain that it would be interesting for us. But grokking all of these details is going to be too great a burden for maintenance of this table in general (for just these entries it's taken many collective hours of learning—this was positive for both @vmx and I but won't scale). |
SHA2-256 with the trailing 2 bits zeroed out. Primary current use is Filecoin. Ref: #161
Reserving the 0xb400 range for Poseidon variants, allowing FIL to iterate on the `fcX` extension of the name where they stay with BLS12-381 and arity=2. High security variant is for extra circuits that are usable in case new attacks arise from the standard variant. Ref: #161 Ref: https://eprint.iacr.org/2019/458.pdf
These describe roots & nodes of a merkle tree, not the underlying data. In the case of CommP and CommD they are binary merkle trees using sha2-256-trunc2. For CommR they are novel structure merkle trees using poseidon-bls12_381-a2-fc1. All nodes of the respective merkle trees could also be described using this codec if required, all the way to base data. It is anticipated that the primary use will be restricted to the roots. Ref: #161 Closes: #161 Closes: #167
These describe roots & nodes of a merkle tree, not the underlying data. In the case of CommP and CommD they are binary merkle trees using sha2-256-trunc2. For CommR they are novel structure merkle trees using poseidon-bls12_381-a2-fc1. All nodes of the respective merkle trees could also be described using this codec if required, all the way to base data. It is anticipated that the primary use will be restricted to the roots. Ref: #161 Closes: #161 Closes: #167
* sha2-256-trunc254-padded * poseidon-bls12_381-a2-fc1 Ref: multiformats/multicodec#161 Ref: multiformats/multicodec#171 Ref: multiformats/multicodec#170
|
Goals
Filecoin needs to be able to construct CIDs for:
These are used to uniquely identify:
According to @porcuquine :
Filecoin using custom hashing
Hashing is different for sealed and unsealed data
Implementation
I have defined three serialization codecs (for each of the three types of data) and two hashing algorithms
Assuming these changes are accepted I will make changes to go-multihash and go-cid.