DealProposal's Label field serializes as non-UTF8 #1248

ec2 · 2020-10-13T16:48:47Z

The Label field in DealProposal is a string that is not guaranteed to be UTF8 as required in the CBOR spec. This causes issues for other client implementations (like us). It will take a LOT of effort for us to work around this bug.

I would suggest somehow enforcing UTF-8 or switching the field to be arbitrary bytes instead.

The text was updated successfully, but these errors were encountered:

anorth · 2020-10-18T23:04:30Z

I haven't investigated deeply, but it sounds like the cbor-gen generated code is permitting non-UTF8 bytestrings in this field.

cc @ZenGround0 @rvagg

rvagg · 2020-10-19T02:41:05Z

I think @willscott might be on the bleeding edge of this one with this work. Also /cc @warpfork @ribasushi as we've collectively been circling around these concerns since Filecoin has been mildly pushing at the definition of "strings" in some situations (thanks mainly to the simple Go mapping of string:[]byte) that introduce some cross-impl difficulties. And /cc @vmx for the Rust -- see linked ChainSafe issue which relates to what you've been working on.

I'll put this on the agenda for our IPLD meeting tomorrow https://github.com/ipld/team-mgmt/#weekly-call although we can't resolve Filecoin concerns there, but @willscott et. al. might have insights that help us suggest a way forward or at least comment further and maybe document some of this at least.

ribasushi · 2020-10-19T02:49:39Z

Do note that Label is designed to take user-input essentially directly. Adding any type of enforcement in the path populating this field would be... difficult. For comparison the Send params are declared []byte https://github.com/filecoin-project/lotus/blob/v0.10.2/chain/types/message.go#L43

mikeal · 2020-10-19T22:04:56Z

We talked about this in the IPLD team call.

We recommend moving to a byte value instead of a string. As you can see, encoding invalid UTF8 in string values will break in some languages. This is a language cross-compatibility problem for which there is no good solution other than “don’t encode non-UTF8 data in strings when you’re in languages that allow you to have invalid UTF8 in your strings.”
Given that these are already on-chain, there’s a bit of an open question about how to best manage dealing with the existing data. The solution is going to vary quite a bit by language and toolchain. @vmx may have more to say about the best way to deal with this in Rust where we’re currently seeing it. This is less of an issue in Go but once you migrate to a new value type handling the variation in value types is something @warpfork is integrating into go-ipld-prime schemas. JS shouldn’t have any problems, strings are UTF16 and can have invalid characters, and since it’s not strongly typed there’s no concern handling a value type variation in IPLD.

anorth · 2020-10-22T02:34:26Z

Ok, representing this as []byte sounds good.

We can't make this change until we do a major-version upgrade whereupon we can change state schema. At that point we can also migrate all deals in the chain state to encode as []byte instead of string. But nothing can change the fact that prior to that migration epoch they will be encoded as strings.

Changing might also be a pain for the markets modules and other software that deals in deals, which will have to handle pre-migration and post-migration values distinctly. We can't do this without buy-in from the teams who will wear that pain on the currently-in-use software. cc @hannahhoward @arajasek @whyrusleeping

austinabell · 2021-01-18T20:35:34Z

what is the reason for this being removed from v3 milestone @ZenGround0 ?

anorth · 2021-01-19T23:19:26Z

I'm not sure of Zen's explicit reason, but the focus of v3 (and likely v4 too) is chain efficiency, attempting to remove the bottleneck that is the chain validation throughput (i.e. gas usage). It's a rather pressing problem until we can improve throughput to match miner sealing hardware.

rvagg · 2021-01-20T03:02:43Z

To add some more weight for getting this changed - it turns out that it's as difficult in JavaScript as it is in Rust to deal with this. We currently have no way to reliably round-trip these blocks in our JavaScript stack since the standard bytes->string conversion methods available all deal with invalid UTF-8 characters by inserting U+FFFD. We can't really even reliably make use of the data if we decode it straight.

We're talking about strategies for doing a reach-around for special cases like this but there's no clear way to deal with it as the standard case (which is why we've hardened up on the SHOULD for Strings as UTF-8 in the IPLD Data Model https://specs.ipld.io/data-model-layer/data-model.html#kinds-reference). But it would be a special case thing where you know ahead of time that you're dealing with a block that has dodgy strings in it so you'd either explicitly ask to have strings dealt with as bytes, or request an alternative decoding scheme that can get you closer to the bytes you care about, such as latin1.

austinabell · 2021-01-20T13:09:49Z

I'll just add my motivation for this, aside from the fact that it's invalid Cbor based on the spec (https://tools.ietf.org/html/rfc7049#section-2.1). The main reason is that we can't validate that any string that gets decoded from Cbor is valid utf-8 and can introduce bugs in the future.

I get that the transition and the removal of unsafely decoding strings cannot happen soon now that it's baked into the v0 and v2 actors, but would be nice to remove ASAP so it hopefully won't be a permanent problem in the future. If this gets changed, it allows us to validate the whole protocol does not include this. This was the only occurrence of invalid utf8 when it was introduced, and the value is never being used.

These interactions may not be a big deal for the go implementations since the behaviour might be expected or deterministic, but causes issues for other implementations to match. Also, external tooling or interfacing of this data may cause inconsistencies (for example through JSON because it also only supports unicode)

Having said this, I don't have a problem if this doesn't change and understand why it wouldn't, just wanted to say my piece for why I think the change should be made.

ZenGround0 · 2021-04-07T17:22:35Z

While not critical to do right now I would like to get this scheduled as a fix for v4.

I need some help understanding how extensive a change this is on the lotus side. @magik6k @arajasek could you estimate how much effort will need to go into supporting a change to the on chain deal proposal format? Relatedly are proposal cids used by the markets code extensively?

laudiacay · 2021-09-24T16:45:42Z

So was the decision just slinging around bytes, or erroring if it's non-UTF8?

ZenGround0 · 2021-09-24T16:52:26Z

~~The decision is still yet to be made but as long as it is feasible I'd like to filter non-UTF8 strings from the deal label field. Probably as part of deal validation.~~

The second part of this is remove non-UTF8 deal labels from state entirely. It's definitely easiest to just let the proposals with this property expire, but we could also do a migration to zero out these labels, or do some sort of reduction down to UTF8.

Rereading this issue properly and the consensus to encode as bytes is better than what I just said. This has the nice property of all deal labels being unambiguously convertible in a state migration.

--edit--

The fact that the proposal signature never makes it on chain makes changing the serialization to cbor byte string acceptable imo. But this will need a FIP and lots of comms to users of the storage market so that we don't break tooling.

ribasushi · 2021-09-24T16:58:07Z

Aye, going with []byte also normalizes to a single type all our custom-input fields ( DealProposal.Label and Message.Params )

laudiacay · 2021-09-24T17:39:00Z

ok cool got a PR ready to go/passing CI for this

* changing the deal label to bytes not strings * removing a todo comment for something that was actually a to-done * wrote migration.... how do i test?? * fixing tests, determinism-gen * left some debugging prints in there * fixes some of the code review * fixing one more issue * fixing rest of code review things- all that's left isss testing * half a test * there's a bug in the migration, test is done * fixing code review

laudiacay · 2021-10-14T16:56:14Z

Done in code, merged into next, FIP hopefully coming soon!!

* changing the deal label to bytes not strings * removing a todo comment for something that was actually a to-done * wrote migration.... how do i test?? * fixing tests, determinism-gen * left some debugging prints in there * fixes some of the code review * fixing one more issue * fixing rest of code review things- all that's left isss testing * half a test * there's a bug in the migration, test is done * fixing code review

ZenGround0 · 2022-04-05T18:22:25Z

FIP 0027 is currently WIP. To read more about the approach we are taking please read FIP 0027 There are four major work items to get this implemented for the upcoming v16 network upgrade

Create a label type that serializes as a union of cbor byte string and cbor text string PR ready for review
The same change must be propagated to builtin rust actors. PR ready for final review
Write a migration that checks labels for utf8 invalidity and converts those labels to byte strings. All proposals with changed serializations will have changed cids and the market pending proposals map must be updated in these cases. This was originally done last year in this closed PR but was reverted for integration reasons. We can repurpose this PR without much rewriting. Current PR here
Lotus integration of the new label type. A WIP PR is open here

austinabell mentioned this issue Oct 14, 2020

Switch serde_cbor to fork to allow unsafe unchecked utf8 deserialization ChainSafe/forest#745

Merged

anorth added bug Something isn't working change-state Changes state schema, necessitating major version upgrade/migration discussion labels Oct 18, 2020

anorth added this to the v3 milestone Oct 22, 2020

anorth added the P2 Medium priority, beneficial for network functionality and growth label Oct 22, 2020

ZenGround0 removed this from the 🧡 v3 milestone Oct 28, 2020

ZenGround0 mentioned this issue Apr 8, 2021

Actors v5 Plan #1391

Closed

28 tasks

ZenGround0 added this to the QoL Improvements milestone Aug 12, 2021

ZenGround0 assigned laudiacay Sep 23, 2021

laudiacay mentioned this issue Sep 24, 2021

fix #1248 changing the deal label to bytes not strings #1496

Merged

laudiacay mentioned this issue Sep 29, 2021

FIP Discussion: Change type of Label in DealProposal to be []byte, rather than String filecoin-project/FIPs#187

Closed

laudiacay closed this as completed Oct 14, 2021

ZenGround0 reopened this Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DealProposal's Label field serializes as non-UTF8 #1248

DealProposal's Label field serializes as non-UTF8 #1248

ec2 commented Oct 13, 2020

anorth commented Oct 18, 2020

rvagg commented Oct 19, 2020

ribasushi commented Oct 19, 2020

mikeal commented Oct 19, 2020

anorth commented Oct 22, 2020

austinabell commented Jan 18, 2021

anorth commented Jan 19, 2021

rvagg commented Jan 20, 2021

austinabell commented Jan 20, 2021

ZenGround0 commented Apr 7, 2021

laudiacay commented Sep 24, 2021

ZenGround0 commented Sep 24, 2021 •

edited

Loading

ribasushi commented Sep 24, 2021

laudiacay commented Sep 24, 2021

laudiacay commented Oct 14, 2021

ZenGround0 commented Apr 5, 2022 •

edited

Loading

DealProposal's Label field serializes as non-UTF8 #1248

DealProposal's Label field serializes as non-UTF8 #1248

Comments

ec2 commented Oct 13, 2020

anorth commented Oct 18, 2020

rvagg commented Oct 19, 2020

ribasushi commented Oct 19, 2020

mikeal commented Oct 19, 2020

anorth commented Oct 22, 2020

austinabell commented Jan 18, 2021

anorth commented Jan 19, 2021

rvagg commented Jan 20, 2021

austinabell commented Jan 20, 2021

ZenGround0 commented Apr 7, 2021

laudiacay commented Sep 24, 2021

ZenGround0 commented Sep 24, 2021 • edited Loading

ribasushi commented Sep 24, 2021

laudiacay commented Sep 24, 2021

laudiacay commented Oct 14, 2021

ZenGround0 commented Apr 5, 2022 • edited Loading

ZenGround0 commented Sep 24, 2021 •

edited

Loading

ZenGround0 commented Apr 5, 2022 •

edited

Loading