Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still not sound: Gradual typing vs. opportunistic decoding #141

Closed
nomeata opened this issue Nov 22, 2020 · 37 comments · Fixed by #168
Closed

Still not sound: Gradual typing vs. opportunistic decoding #141

nomeata opened this issue Nov 22, 2020 · 37 comments · Fixed by #168

Comments

@nomeata
Copy link
Collaborator

nomeata commented Nov 22, 2020

Looks like we still haven’t found the grail yet…

TL;DR: Our relaxed coercion rules for opt (#110 and following) may lead to “hard” deserialization errors later on.

I seems that the following can’t all hold:

  • We are sound in the higher-order case (as per IDL-Soundness)
  • We have no subtyping checks while deserialization (one of our Design goals).
  • Type Erasure (another of the design goal)
  • Record Extensibility, at least in the form implemented right now.

Counter-example

Here is the counter example:

service A1 : { bar : () -> () }
service B1 : { baz : () -> () } 

service C : { 
  init : (service A1, service B1) -> (); // stores (a1, b1)
  run : () -> () ;  // runs a1.bar(b1.baz())
}

If we pass A1 and B1 to C.init, all is well. C.run works fine.

Now we upgrade A1 to A2 and B1 to B2, as follows:

type WantsBool = service : { foo : (bool) -> () }
type WantsInt = service : { foo : (int) -> () }
service wantsInt : WantsInt
service A2 : {  bar : (opt WantsBool) -> () } // calls `arg.foo(true)` if `arg` is not `null`
service B2 : { baz : () -> (opt WantsInt) }  // returns (opt wantsInt)

These upgrades are permitted by our :<, and actually desirable (i.e. restricting <: to avoid “obviously bogus” upgrades doesn’t help)

But now C.run will cause B2 to pass opt wantsInt to A2. This will succeed at deserialization (because “Function and services references coerce unconditionally”). Now A2 invokes wantsInt.foo(true), and the deserialization of (true) at expected type (bool) will fail in wantsInt.

Generic Soundness proof

In https://github.com/dfinity/candid/blob/master/spec/IDL-Soundness.md#proof-for-canonical-subtyping we have a generic soundness proof for “Canonical subtyping”, that says “A solution with canonical subtyping (transitive, contravariant) is sound.”. And our <: is that!

But it’s not “decompositional” any more. In particular,

If t1 <: t2 and s1 in t1 <: s2 in t2 then s1 <: s2.

(as required in the generic proof) doesn’t hold .

Concretely in our example above, we have opt WantsInt <: opt WantsBool (because all opt types related). But because the reference value is passed through, this also means WantsInt in opt WantsInt <: WantsInt <: opt WantsBool, but we don’t have WantsInt <: WantsBool.

So…

Dunno. No concrete suggestions. It’s sunday morning, and I just wanted to get this insight out of my head.

@rossberg
Copy link
Contributor

rossberg commented Nov 23, 2020

Hm, let me see.

Before C itself gets changed to the extended signature types and upgraded, it will keep ignoring the new optional args & results and not pass anything on to bar.

However, if you do try to change C to use the new types, you will probably get something like a warning that the call bar(baz()) uses unsafe subtyping. In Motoko, you wouldn't even be able to compile it anymore, and would be forced to change the program to pass null explicitly for the new parameter. (The example is a bit odd in that it composes two calls over a unit value, which is a bit pointless; but you can of course enrich the example such that there is at least one preexisting result/arg involved).

So unless I am missing something, this kind of case can only break if the general opt subtyping was allowed in the source language itself -- which would alway be unsound regardless of Candid, unless subtyping was coercive and coerced to null in that case. Which would also fix the problem.

Am I missing something? Is there a way to construct this example with a sound source language? Maybe by going one order higher? (Edit: Don't think that's possible, due to the property that we do not have "covert channels", so never pass on unknown values.)

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

Before C itself gets changed to the extended signature types and upgraded, it will keep ignoring the new optional args & results and not pass anything on to bar.

Oh, right, I was simplifying this too much. (I think I made the same mistake when building counter examples many times so far…)

I need to make it more higher order:

Counter example (2nd try)

Let me try to decompose this into two steps:

  1. A “unsafe coercion” setup
  2. Exploiting that

For 1., let me try to construct a situation where a service s : service_typ1 turns into any s : service service_typ2.

Preparation:

service A1 : { return_s : () -> () }
service B1 : { accept_s : (func () -> ()) -> () } 
service C : { 
  init : (service A1, service B1) -> (); // stores (a1, b1)
  run : () -> () ;  // runs b1.accept_s(a1.return_s)
}

If we pass A1 and B1 to C.init, all is well. C.run and thus B.baz works fine.

Now we upgrade A1 to A2 and B1 to B2, as follows:

service A2 : { return_s : () -> (opt service_typ1)} // return (opt s)
service B2 : { accept_s : (func () -> (opt service_typ2)) -> ()}  // calls arg.(), if result is not null, stores it as (s : service_typ2)

These upgrades are permitted by our <: relation.

But now C.run will cause A2 to pass s: service_typ1 to B2 This will succeed at deserialization (because “Function and services references coerce unconditionally”).

Now B2 has (s : service_typ2), which is clearly unsound.

Actually exhibiting the crash is now easy, just set

service_typ1 = service { foo : (bool) -> () }
service_typ2 = service { foo : (int) -> () }

and let B2.accept_s call s.foo(3), which breaks s.foo (as that expects a bool).

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

(Edit: Don't think that's possible, due to the property that we do not have "covert channels", so never pass on unknown values.)

We never pass on unknown values, but we can pass on values at “accidential types”. We said that at opt we are fine returning a non-null value if the values happen to work out, and don’t mind if the sender had a different type there (e.g. extra variants, the type of an empty list, or – crucially – different types on references).

@rossberg
Copy link
Contributor

Right. Our opt rule relies on detecting any eventual inconsistencies and falling back to null. But if functions are involved, we do not detect any inconsistency. And when an actual call is performed, we have already forgot the assumed type, so a deferred check a la gradual typing isn't possible either.

It seems like the only fix here would be to go the proper gradual typing route and wrap deserialised function values -- or equivalently, remember their deserialisation type and use that for calls as well as for consecutive serialisations. And yes, do a subtype check upon deserialisation if we want eager detection.

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

It seems like the only fix here would be to go the proper gradual typing route and wrap deserialised function values

I don’t think wrapping (i.e. eta-expanding and inserting code) works, as we can’t transfer closures, only the raw function or service reference, but

– or equivalently, remember their deserialisation type and use that for calls as well as for consecutive serialisations.

might help. But:

And yes, do a subtype check upon deserialisation if we want eager detection.

Is that really optional? If we don't do that check in the decoding of opt service_typ2 in B2.accept_s (where thing go wrong), then even if B2 remembers the type, what do we do with that information?

It seems that the eager subtype check is a reqiurement to fix this problem, and if we have the subtype check, we don't actually need any wrapping/type passing. Because the subtyping check is enough to ensure that

If t1 <: t2 and s1 in t1 <: s2 in t2 then s1 <: s2.

and then the generic soundness proof applies again.

@rossberg
Copy link
Contributor

Well, if we were to go the gradual route then there wouldn't be any check at deserialisation time, but it would decompose into checks at use. Of course, it is too late to fall back to null at that point, so instead you'd get deferred type errors with some notion of blame (which arguably is of limited use in a distributed setting). That would be a possible semantics, but one can argue that it doesn't fit our approach.

So yeah, the choice seems to be between

  1. Eager error detection involving subtype checks.
  2. Lazy error detection involving gradual typing and blame.
  3. No soundness.

All else being equal, (1) would seem like the obvious choice. But the price would be implementing subtyping in deseralisers.

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

But the price would be implementing subtyping in deseralisers.

I am warming up that idea… it’s not like it’s impossible.

The main complexity I see here is to implement recursion breaker correctly, likely by memoization as we do in type.ml all over the place. The data structure, in general, seems to be quadratic in size (a subset of (types in message × types in expected type), or maybe just carefully chosen loop breakers). In the Motoko RTS we are least better equipped now to implement that (we have the “type hash”, and we can implement that part in C or Rust) than half a year ago. Still not an exciting prospect, though.

@rossberg
Copy link
Contributor

You cannot avoid the quadratic worst-case with structural subtyping. That's not necessarily a problem for normal programs, which won't ever exercise this, but might be a concern with malicious parties sending "type checking bombs" (though I believe even that is only possible if the receiver uses a non-minimal type definition itself).

My other worry is that the extra complexity of implementing such a subtyping check might be seen as a con against using Candid.

@crusso
Copy link
Contributor

crusso commented Nov 23, 2020

So what if we had a separate opt type, say maybe <typ>, that restricted contents to first-order data so reference types cannot be encountered while deserialzing maybe <typ> values? Would that be enough of a restriction to recover soundness?

I'd much rather we ship something less expressive and sound than more expressive and unsound but I'm guess you are all with me on that, right?

@rossberg
Copy link
Contributor

@crusso, interesting idea to restrict the offensive subtyping to first-order! I don't even think you'd need a separate type for that (which would be harder to target). Simply restrict all "inverse" and "unsafe" rules to only apply with first-order constituent types, which should cover the vast majority of use cases. @nomeata, what do you think?

@crusso
Copy link
Contributor

crusso commented Nov 23, 2020

Another suggestion might be to make the maybe <typ>, not first-order, but invariant, so as a long as subtyping holds, there's nothing to verify. It means you woudn't be able to reserve or deprecate with maybe reserved and maybe empty (or whatever way around it is).

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

Simply restrict all "inverse" and "unsafe" rules to only apply with first-order constituent types, which should cover the vast majority of use cases. @nomeata, what do you think?

This might be too restrictive. Recall that we want both of these:

{ foo : opt (service s1) } <: {} -- normal subtyping
{} <: { foo : opt (service s2) } -- reverse subtyping, for extensible argument records

And hence, by transitivity, we get the problematic { foo : opt (service s1) } <: {} <: { foo : opt (service s2) }.

So you are saying we don’t get {} <: { foo : opt (service s2) }?

In other words, you can add new optional arguments to your services, but not if they would take function or service references?

but invariant

I don’t think that works; see the example above. You need variance (in fact the “anything goes rule”) to keep the relation transitive.

@rossberg
Copy link
Contributor

rossberg commented Nov 23, 2020

Yes, such a restriction would prevent you from (contravariantly) adding higher-order fields, even when optional. I was thinking that that may be okay? Such a case might be rare enough that it is fine to require a more heavyweight solution for it. But obviously, I'm hand-waving with that reasoning.

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

Such a case might be rare enough that it is fine to require a more heavyweight solution for it.

Such as using the untyped principal and do unsafe casting (which I expect is how people will work around this ;-))

Still not easy, because of transitivity. It sounds we want

{} <: { foo : opt (variant { first_order : int ) }

but unless we restrict the totally harmless rule t <: t' ⟹ opt t <: opt t' we also have

{ foo : opt (variant { first_order : int ) } <: { foo : opt (variant { first_order : int; higher_order : service s ) }

but now by transitivity we have

{} <: { foo : opt (variant { first_order : int; higher_order : service s ) }

and it's broken again.

@rossberg
Copy link
Contributor

Yes, good point. In a certain twisted way I find it reassuring that hacks like that won't work. :)

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

In a way I find it reassuring that certain twisted hacks like that won't work!

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

We could forbid function types under opt. That’d would be safe, I think… but very harsh.

@crusso
Copy link
Contributor

crusso commented Nov 23, 2020

Or use a different opt type for this, which is what I was suggesting with maybe, retaining opt as was. maybe types are resrticted to first-order content, opt types are unrestricted.

@chenyan-dfinity
Copy link
Contributor

chenyan-dfinity commented Nov 23, 2020

Even without the opt rules, service types are relying on runtime checks, e.g., we don't know if a service really contains a method bar. So this is not worse than what was before?

Following the runtime check idea, we can fetch the current interface while deserialization, and we only check for subtyping for the first-order case?

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 23, 2020

Even without the opt rules, service types are relying on runtime checks, e.g., we don't know if a service really contains a method bar. So this is not worse than what was before?

The promise of candid is that, as long as you only ever upgrade along <:, and only introduce new service reference at their currently true type, then you won't get any deserialization errors. This is the essence of “IDL-Soundness”, and it held before introducing the opportunistic opt decoding.

Following the runtime check idea, we can fetch the current interface while deserialization, and we can check for a lightweight subtyping?

If we are willing to check for subtyping anyways, we can just apply that to compare the type in the type description with the expected type. That’d be good enough to solve the problem at hand. Fetching the actual interface at decoding doesn't sound appealing, decoding a message should not require async calls (in canisters) or network lookup (outside), I hope.

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 30, 2020

Ok, so what options (heh) we have:

  • Do nothing. System works well in practice. Some calls in specific, unlikely situations will cause decoding erros. Can’t prove nice theorems.
  • Do the full subtyping check on decoding. Makes it less likely that others will implement Candid well (although with a comprehensive and accessible test suite we can probably offset that somewhat). A bit more expensive at runtime. We can prove some form of soundess (although it only holds under strong assumptions, namely that all upgrades are according to the rules).
  • Don't allow function types under opt.
  • Introduce a new type for the weird subtyping rules; don't allow functions there.

Anything else?

@rossberg
Copy link
Contributor

I think that about sums it up.

(although it only holds under strong assumptions, namely that all upgrades are according to the rules)

Isn't that always the case?

@nomeata
Copy link
Collaborator Author

nomeata commented Nov 30, 2020

Right, but one might argue that adding “unlikely situations will cause decoding errors don’t actually occur” to the list of assumptions isn’t too bad. Not being able to prove theorems is probably worse :-)

nomeata added a commit that referenced this issue Dec 4, 2020
based on our experience, we really can really use some formal treatment of our
Candid work.

This branch contains some initial work. Part of this commit:

 * A simple Coq setup (using the more modern dune-based setup)
 * A nix-setup to build this
 * Simple CI integration
 * A Coq-ification of the definitions in IDL-Soundness.md
 * Mechanizing the “canonical subtyping is sound” proof in that document

More work is pending on #141, with ongoing experiments in #147.
This was referenced Dec 11, 2020
@nomeata
Copy link
Collaborator Author

nomeata commented Jan 19, 2021

@rossberg, so what direction should we head here? Can we get around doing the subtyping check upon decoding?

nomeata added a commit that referenced this issue Jan 19, 2021
This contains an initial formalization of Candid in Coq. It covers these types:
`nat`, `int`, `null`, `opt t`, `empty`, `reserved`

It considers two different formalisms:

* `NoOpportunisticDecoding`
  This is without `t <: opt t'`, but with `t <: opt t`.
  Things go through, although the `opt t <: opt t'` rule has to be tweaked to keep subtyping transitive (see #146).

* `OpportunisticDecoding`
  This is with `t <: opt t'`.
  Because of the negative hypotheses in the coercion relation, this can’t be defined as a simple inductive relation (which is a seroius smell!).
  So instead I define it as a function (which is mostly straight forward), and prove the properties there.
  We _could_ take this as indication that maybe our spec should also just define it via equations. We don’t gain anything from the relational presentation, I think, and implementors all anyways implement functions.
  Nevertheless, I did prove that the relation defined by admits the intro and induction rules that _would_ come out of the inductive relation (ignoring the negative hypotheses). This revealed some issues with the way the rules are presented, will create a PR for that soon.

In both cases, I prove
 * Correctness of decoding
 * Roundtripping
 * Uniqueness of decoding
 * Soundness of subtyping
 * Transitivity of subtyping

I do _not_ prove Higher-order soundness, because we know it does not hold (see #141).

This also uses some proof-of-concept “named Coq cases“ feature, see https://www.joachim-breitner.de/blog/777-Named_goals_in_Coq.
@rossberg
Copy link
Contributor

It seems like you warmed up to that solution... I'm still on the fence with the extra complication, but I'm willing to try.

Would this actually be conservative over the status quo? I.e., if we later decide to drop the check again, it does not break anything that works in the presence of the check?

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

Would this actually be conservative over the status quo? I.e., if we later decide to drop the check again, it does not break anything that works in the presence of the check?

Well, dropping the check will break the issue that we are trying to fix by introducing it…

The effect of introducing the check is to turn a decode failure later at some other canister into a decode failure earlier. And this earlier failure can be safely caught via the opt decoding error handle…

I am not really warmed up. It’s shooting with a cannon at a relatively obscure corner cases. But I don’t see an alternative at this point, short of just giving up on any formal notion of soundness. But then, why even bother?

@rossberg
Copy link
Contributor

The effect of introducing the check is to turn a decode failure later at some other canister into a decode failure earlier. And this earlier failure can be safely caught via the opt decoding error handle…

Yes, that's what I'm getting at: removing the check would only delay or eliminate failures. It would not change the behaviour of non-failing executions. At least I assume that is the case.

That means that we can safely introduce it and have the emergency option of falling back to the status quo later if it turns to not be practical (or nobody implements it). Certainly not what I'd want, but at least avoiding a sort of dead end.

I am not really warmed up. It’s shooting with a cannon at a relatively obscure corner cases. But I don’t see an alternative at this point, short of just giving up on any formal notion of soundness. But then, why even bother?

Yes, agreed.

The only sound alternative I see would be proper gradual typing, but then every language runtime would have to push that through for function and actor references, which seems way more unrealistic.

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

It would not change the behaviour of non-failing executions. At least I assume that is the case.

It would! Because decoding under opt turns failure into success (at value null), the fix turns a forbidding crash later into a gracefully handled type mismatch (but not decoding failure!) earlier.

@rossberg
Copy link
Contributor

Ah, you're right. Oh well. Maybe it's just not worth over-theorising possible futures at this point...

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

Maybe it's just not worth over-theorising possible futures at this point...

Probably not, but what does that mean? Introduce the subtyping check, or leave things as they are?

@rossberg
Copy link
Contributor

Sorry, means going forward with introducing. If there are no objections.

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

No, just faint hopes that maybe you’ll have a great idea about how we get around doing that. It is going to be a PITA to implement that…

But it means we do the subtyping check on the opt argument, right?

Right now, the value opt (variant { foo } : variant { foo; bar }) decodes as opt variant { foo } at expected type variant { foo }.

If we do subtyping checks, we do them not just when decoding reference types, but rather when decoding an opt, and opt (variant { foo } : variant { foo; bar }) decodes as null at expected type variant { foo }?

@rossberg
Copy link
Contributor

rossberg commented Jan 20, 2021

No, just faint hopes that maybe you’ll have a great idea about how we get around doing that.

I wish...

If we do subtyping checks, we do them not just when decoding reference types, but rather when decoding an opt, and opt (variant { foo } : variant { foo; bar }) decodes as null at expected type variant { foo }?

Interesting question. That was what I was assuming, but I suppose we could only do the check on reference types. That would imply that the first-order fragment of the IDL does still not need such checks. Maybe that's an attractive property? But of course the soundness property becomes more wishy-washy then. To be honest, I'm a bit fuzzy on the implications right now. Have you thought that alternative through?

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

Not completely through, but I think doing the check on opt is more advisible:

  • In cases where the static check fails, but the dynamic one succeds, we get some “accidential” decoding, i.e. some non-null value comes out despite the sender being on a different interface evolution path. That is more likely to be hurtful than helpful.
  • If we only do it on reference types, which are hardly used (at the moment), the code paths will hardly be exercised, and remain buggy, and indepdenten implementors are more likely to take a shortcut there and not implement it.
  • The decoding function doesn't need any backtracking any more: Either it is a subtype (and will succeed to decode), or it it is not (and we have null and skip the value). That simplifies the spec, the proofs and the implementation a bit.

Although, maybe this change will void us from having to worry about independent implementors…

@rossberg
Copy link
Contributor

Yeah, avoiding the backtracking certainly is big plus, both semantically and implementation-wise.

As for implementors, maybe we should offer a generic implementation in C, that other languages/runtimes can use?

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

Let’s refine the precise formalism in #168

As for implementors, maybe we should offer a generic implementation in C, that other languages/runtimes can use?

I guess that might be something to think about. Other languages would then probably have to turn their “expected” type into something the generic implementation can work with, but it could actually be just a type description as we use in the message format (generating that is probably needed anyways).

I was hoping that you could pre-process given and expected type and easily mark all the opt t in the output type that will fail, but because of sharing the in graph representation of the output type, this is not so easy…

@nomeata
Copy link
Collaborator Author

nomeata commented Jan 20, 2021

Maybe a function that takes two type tables (as they occur in the messages), and return a list of “type table paths”, pointing to all nodes in the output type’s graph (well, tree) where an opt type should always become null… but I am not sure if that list can be empty :-)

Ah, or maybe better: a set of pairs (input type table index, output type table index), where (i1, i2) is in that set if i2 is the index of an opt t in the output type, and i1 is the index of a type opt t' or t' which may need to be decoded at opt t, but with not (t1 <: t). (Essentially the interesting entries from the memo table that the obvious algorithm would have to build up)

That should give a decoder (like ours) sufficient information, while containing enough the hard work. Still ugly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants