Skip to content
This repository has been archived by the owner on Jun 6, 2023. It is now read-only.

DealProposal's Label field serializes as non-UTF8 #1248

Open
ec2 opened this issue Oct 13, 2020 · 16 comments · Fixed by #1496
Open

DealProposal's Label field serializes as non-UTF8 #1248

ec2 opened this issue Oct 13, 2020 · 16 comments · Fixed by #1496
Assignees
Labels
bug Something isn't working change-state Changes state schema, necessitating major version upgrade/migration discussion P2 Medium priority, beneficial for network functionality and growth

Comments

@ec2
Copy link

ec2 commented Oct 13, 2020

The Label field in DealProposal is a string that is not guaranteed to be UTF8 as required in the CBOR spec. This causes issues for other client implementations (like us). It will take a LOT of effort for us to work around this bug.

I would suggest somehow enforcing UTF-8 or switching the field to be arbitrary bytes instead.

@anorth
Copy link
Member

anorth commented Oct 18, 2020

I haven't investigated deeply, but it sounds like the cbor-gen generated code is permitting non-UTF8 bytestrings in this field.

cc @ZenGround0 @rvagg

@anorth anorth added bug Something isn't working change-state Changes state schema, necessitating major version upgrade/migration discussion labels Oct 18, 2020
@rvagg
Copy link
Member

rvagg commented Oct 19, 2020

I think @willscott might be on the bleeding edge of this one with this work. Also /cc @warpfork @ribasushi as we've collectively been circling around these concerns since Filecoin has been mildly pushing at the definition of "strings" in some situations (thanks mainly to the simple Go mapping of string:[]byte) that introduce some cross-impl difficulties. And /cc @vmx for the Rust -- see linked ChainSafe issue which relates to what you've been working on.

I'll put this on the agenda for our IPLD meeting tomorrow https://github.com/ipld/team-mgmt/#weekly-call although we can't resolve Filecoin concerns there, but @willscott et. al. might have insights that help us suggest a way forward or at least comment further and maybe document some of this at least.

@ribasushi
Copy link
Contributor

Do note that Label is designed to take user-input essentially directly. Adding any type of enforcement in the path populating this field would be... difficult. For comparison the Send params are declared []byte https://github.com/filecoin-project/lotus/blob/v0.10.2/chain/types/message.go#L43

@mikeal
Copy link

mikeal commented Oct 19, 2020

We talked about this in the IPLD team call.

  1. We recommend moving to a byte value instead of a string. As you can see, encoding invalid UTF8 in string values will break in some languages. This is a language cross-compatibility problem for which there is no good solution other than “don’t encode non-UTF8 data in strings when you’re in languages that allow you to have invalid UTF8 in your strings.”

  2. Given that these are already on-chain, there’s a bit of an open question about how to best manage dealing with the existing data. The solution is going to vary quite a bit by language and toolchain. @vmx may have more to say about the best way to deal with this in Rust where we’re currently seeing it. This is less of an issue in Go but once you migrate to a new value type handling the variation in value types is something @warpfork is integrating into go-ipld-prime schemas. JS shouldn’t have any problems, strings are UTF16 and can have invalid characters, and since it’s not strongly typed there’s no concern handling a value type variation in IPLD.

@anorth
Copy link
Member

anorth commented Oct 22, 2020

Ok, representing this as []byte sounds good.

We can't make this change until we do a major-version upgrade whereupon we can change state schema. At that point we can also migrate all deals in the chain state to encode as []byte instead of string. But nothing can change the fact that prior to that migration epoch they will be encoded as strings.

Changing might also be a pain for the markets modules and other software that deals in deals, which will have to handle pre-migration and post-migration values distinctly. We can't do this without buy-in from the teams who will wear that pain on the currently-in-use software. cc @hannahhoward @arajasek @whyrusleeping

@anorth anorth added this to the v3 milestone Oct 22, 2020
@anorth anorth added the P2 Medium priority, beneficial for network functionality and growth label Oct 22, 2020
@ZenGround0 ZenGround0 removed this from the 🧡 v3 milestone Oct 28, 2020
@austinabell
Copy link
Contributor

what is the reason for this being removed from v3 milestone @ZenGround0 ?

@anorth
Copy link
Member

anorth commented Jan 19, 2021

I'm not sure of Zen's explicit reason, but the focus of v3 (and likely v4 too) is chain efficiency, attempting to remove the bottleneck that is the chain validation throughput (i.e. gas usage). It's a rather pressing problem until we can improve throughput to match miner sealing hardware.

@rvagg
Copy link
Member

rvagg commented Jan 20, 2021

To add some more weight for getting this changed - it turns out that it's as difficult in JavaScript as it is in Rust to deal with this. We currently have no way to reliably round-trip these blocks in our JavaScript stack since the standard bytes->string conversion methods available all deal with invalid UTF-8 characters by inserting U+FFFD. We can't really even reliably make use of the data if we decode it straight.

We're talking about strategies for doing a reach-around for special cases like this but there's no clear way to deal with it as the standard case (which is why we've hardened up on the SHOULD for Strings as UTF-8 in the IPLD Data Model https://specs.ipld.io/data-model-layer/data-model.html#kinds-reference). But it would be a special case thing where you know ahead of time that you're dealing with a block that has dodgy strings in it so you'd either explicitly ask to have strings dealt with as bytes, or request an alternative decoding scheme that can get you closer to the bytes you care about, such as latin1.

@austinabell
Copy link
Contributor

I'll just add my motivation for this, aside from the fact that it's invalid Cbor based on the spec (https://tools.ietf.org/html/rfc7049#section-2.1). The main reason is that we can't validate that any string that gets decoded from Cbor is valid utf-8 and can introduce bugs in the future.

I get that the transition and the removal of unsafely decoding strings cannot happen soon now that it's baked into the v0 and v2 actors, but would be nice to remove ASAP so it hopefully won't be a permanent problem in the future. If this gets changed, it allows us to validate the whole protocol does not include this. This was the only occurrence of invalid utf8 when it was introduced, and the value is never being used.

These interactions may not be a big deal for the go implementations since the behaviour might be expected or deterministic, but causes issues for other implementations to match. Also, external tooling or interfacing of this data may cause inconsistencies (for example through JSON because it also only supports unicode)

Having said this, I don't have a problem if this doesn't change and understand why it wouldn't, just wanted to say my piece for why I think the change should be made.

@ZenGround0
Copy link
Contributor

While not critical to do right now I would like to get this scheduled as a fix for v4.

I need some help understanding how extensive a change this is on the lotus side. @magik6k @arajasek could you estimate how much effort will need to go into supporting a change to the on chain deal proposal format? Relatedly are proposal cids used by the markets code extensively?

@ZenGround0 ZenGround0 mentioned this issue Apr 8, 2021
28 tasks
@ZenGround0 ZenGround0 added this to the QoL Improvements milestone Aug 12, 2021
@laudiacay
Copy link
Contributor

So was the decision just slinging around bytes, or erroring if it's non-UTF8?

@ZenGround0
Copy link
Contributor

ZenGround0 commented Sep 24, 2021

The decision is still yet to be made but as long as it is feasible I'd like to filter non-UTF8 strings from the deal label field. Probably as part of deal validation.

The second part of this is remove non-UTF8 deal labels from state entirely. It's definitely easiest to just let the proposals with this property expire, but we could also do a migration to zero out these labels, or do some sort of reduction down to UTF8.

Rereading this issue properly and the consensus to encode as bytes is better than what I just said. This has the nice property of all deal labels being unambiguously convertible in a state migration.

--edit--

The fact that the proposal signature never makes it on chain makes changing the serialization to cbor byte string acceptable imo. But this will need a FIP and lots of comms to users of the storage market so that we don't break tooling.

@ribasushi
Copy link
Contributor

Aye, going with []byte also normalizes to a single type all our custom-input fields ( DealProposal.Label and Message.Params )

@laudiacay
Copy link
Contributor

ok cool got a PR ready to go/passing CI for this

laudiacay added a commit that referenced this issue Oct 14, 2021
* changing the deal label to bytes not strings

* removing a todo comment for something that was actually a to-done

* wrote migration.... how do i test??

* fixing tests, determinism-gen

* left some debugging prints in there

* fixes some of the code review

* fixing one more issue

* fixing rest of code review things- all that's left isss testing

* half a test

* there's a bug in the migration, test is done

* fixing code review
@laudiacay
Copy link
Contributor

Done in code, merged into next, FIP hopefully coming soon!!

ZenGround0 pushed a commit that referenced this issue Dec 31, 2021
* changing the deal label to bytes not strings

* removing a todo comment for something that was actually a to-done

* wrote migration.... how do i test??

* fixing tests, determinism-gen

* left some debugging prints in there

* fixes some of the code review

* fixing one more issue

* fixing rest of code review things- all that's left isss testing

* half a test

* there's a bug in the migration, test is done

* fixing code review
@ZenGround0 ZenGround0 reopened this Apr 5, 2022
@ZenGround0
Copy link
Contributor

ZenGround0 commented Apr 5, 2022

FIP 0027 is currently WIP. To read more about the approach we are taking please read FIP 0027 There are four major work items to get this implemented for the upcoming v16 network upgrade

  • Create a label type that serializes as a union of cbor byte string and cbor text string PR ready for review
  • The same change must be propagated to builtin rust actors. PR ready for final review
  • Write a migration that checks labels for utf8 invalidity and converts those labels to byte strings. All proposals with changed serializations will have changed cids and the market pending proposals map must be updated in these cases. This was originally done last year in this closed PR but was reverted for integration reasons. We can repurpose this PR without much rewriting. Current PR here
  • Lotus integration of the new label type. A WIP PR is open here

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working change-state Changes state schema, necessitating major version upgrade/migration discussion P2 Medium priority, beneficial for network functionality and growth
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants