Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent changes of ZAP broke czmq zauth #2762

Closed
vyskocilm opened this issue Oct 2, 2017 · 22 comments · Fixed by #2773
Closed

Recent changes of ZAP broke czmq zauth #2762

vyskocilm opened this issue Oct 2, 2017 · 22 comments · Fixed by #2773

Comments

@vyskocilm
Copy link
Contributor

vyskocilm commented Oct 2, 2017

make check on czmq fails

czmq_selftest: src/zauth.c:740: zauth_test: Assertion `!success' failed.

I run the bisect on libzmq code and the result is

a5f94cb610f0a66046a8e749fb08f1e673c8be0e is the first bad commit
commit a5f94cb610f0a66046a8e749fb08f1e673c8be0e
Author: sigiesec <simon.giesecke@btc-ag.com>
Date:   Mon Sep 18 11:48:14 2017 +0200

    Problem: tests without ZAP handler are failing
    
    Solution: emit events as expected by tests, and refuse connections when
    ZAP is required but no handler started
    
    Addresses #2711 partially

:040000 040000 4ab2d84819ec85a23380b72a65bc5e72499677d6 514e53a060fee4319caf28c9cc586016333cfa40 M      src
:040000 040000 77535d58b312f98fa908fc92f5e64fa26add98ba 90f420e8693104eb2f99ca481c11f7066ee0934e M      tests
bisect run success

After reverting that one, czmq tests passed again. However it's not possible to do so on master as there is a queue of identity related changes. cc @bluca, @sigiesec please take a look on a problem, or I'll revert the changes. We'll have workshop this weekend, so need working libzmq master :)

I run bisect with following script

#!/bin/bash

make clean || exit 125
make -j 4 install || exit 125
#make check || exit 125

pushd ../czmq
make clean || exit 125
make -j4 || exit 125
make check || exit 1
popd

exit 0
@bluca
Copy link
Member

bluca commented Oct 2, 2017

Thanks for the analysis - I'll have a look shortly

@sigiesec
Copy link
Member

sigiesec commented Oct 2, 2017

I don't know czmq, and I cannot see from this information which test case failed exactly.
I assume that the tests make assumptions on implementation details that are not true anymore. Before, libzmq allowed using a ZAP handler with an empty domain, in violation of the ZAP specification (see discussion in #2711). This is no longer the case, ZAP authentication is disabled in this case.
I think the czmq tests must probably be adapted.

@bluca
Copy link
Member

bluca commented Oct 2, 2017

It's a bit tricky, as technically it is a public API change :-/

@sigiesec
Copy link
Member

sigiesec commented Oct 2, 2017

Every bug fix is a public API change. Fixing a bug breaks code depending on the presence of the bug.

A different option to fix this, would be to use some "dummy" ZAP domain by default if any of the CURVE keys is set, which must be explicitly cleared if one wishes to disable ZAP authentication.

Maybe this should not be fixed in a 4.2.3, but I think it should be fixed in a 4.3.0. The current master contains many changes that are beyond bug fixes, which IMO should not be released as a 4.2.3 at all. While they may not break existing tests, they change behaviour that is not tested explicitly. I would advise to keep changes in any PATCH version as minimal as possible.

@bluca
Copy link
Member

bluca commented Oct 2, 2017

So this is what the failing test does:

    680     //  Try PLAIN authentication
    681     zsock_set_plain_server (server, 1);
    682     zsock_set_plain_username (client, "admin");
    683     zsock_set_plain_password (client, "Password");
    684     success = s_can_connect (&server, &client, true);
    685     assert (!success);

The helper function binds and connects the server and client, and tries to bounce messages.

Every bug fix is a public API change. Fixing a bug breaks code depending on the presence of the bug.

That's not always the case though - most of the times a bug is just a bug, ie: something does not work. Here the problem is that there are a lot of examples where the previous, wrong, behaviour was used, and that users could have copied. We ourselves often suggest users to go and look at the self tests to see how to use the API.

So I think your suggestion of initialising the domain to a default one is be a pretty good compromise. This way we fix the broken implementation of the protocol, but also we avoid breaking existing applications - if they didn't set the domain before, then it won't matter if it's going to be called something like "zap_default" or something like that.

Maybe this should not be fixed in a 4.2.3, but I think it should be fixed in a 4.3.0. The current master contains many changes that are beyond bug fixes, which IMO should not be released as a 4.2.3 at all. While they may not break existing tests, they change behaviour that is not tested explicitly. I would advise to keep changes in any PATCH version as minimal as possible.

Yes if there are backward compatible changes in behaviour (ie: new APIs move out of draft state) then we'll bump the minor.

@sigiesec
Copy link
Member

sigiesec commented Oct 2, 2017

Ok, I admit it is rather unlikely that someone depends on the presence of a crash for example ;) Still, it is a change in visible behavior.

Besides the fact that the ZAP protocol was violated, the behaviour of the different mechanisms was inconsistent (which may also be viewed as a bug). This inconsistency was deliberately resolved by #2752.

IIRC, before #2752

  • NULL disabled authentication if no ZAP domain was set explicitly (regardless of the presence of a ZAP handler) OR if no ZAP handler was present
  • PLAIN silently rejected all connections if no ZAP handler was present
  • CURVE disabled ZAP authentication if no ZAP handler was present

Now, after #2752

  • all mechanisms use a ZAP handler if and only if a ZAP domain was set, and reject connections if no ZAP handler is present [PLAIN still disallows invalid ZAP configurations (i.e. no ZAP domain set OR no ZAP handler present) but emits a monitor event in that case]

If we use a default ZAP domain for CURVE and PLAIN, this will change to

  • NULL uses a ZAP handler if and only if a ZAP domain was set, and reject connections if no ZAP handler is present
  • PLAIN will fail if no ZAP handler is present
  • CURVE will fail if no ZAP handler is present, unless a the ZAP domain is explicitly set to an empty one

I am not completely convinced that this is really a good idea, as it will partly defeat the attempt to make behaviour consistent across mechanisms.

In #2711, I suggested at some point to introduce an additional socket option that is solely used to activate/deactivate ZAP handling. For 4.2.x, the default could be to resort to the "legacy" behaviour, but for 4.3.x I would rather suggest that the default is that ZAP handling is enabled, but a ZAP domain must be set explicitly.

I admit that this is a kind of two "second-order bugs":

  • the ZAP specification was violated (but this might not matter to a user that supplies his own ZAP handler)
  • the implementation was as specified for each single mechanism (nah, not really in the case of NULL) but the design was inconsistent between mechanisms, and hard to use (it was impossible to enable ZAP authentication for one CURVE socket, and disable it for another)

@wesyoung
Copy link

wesyoung commented Oct 2, 2017

keep it broken- or dummies like me [zyre+curve] won't fix it the right way until it breaks again.. :)

course- defaulting to "global" or something is what i'm about to PR to zyre, so maybe that's not such a bad idea either [less code up high]. speaking from "i'm learning ZAP" this feels to me as a "you'll know when you need domains when you need them and until then you're just gonna use a global default anyway..."

@wesyoung
Copy link

wesyoung commented Oct 2, 2017

also- the way it's re-fixed today- teh czmq tests fail, but in other code i've been testing with (up in python land), the only reason i noticed this, is because i wasn't seeing the zauth messages i thought i would have (all of a sudden, they were there ~13 days ago). so this has the potential to fail silently if you implemented it improperly too...

spent a few hours scratching my head on that one, finally rebuilt to see the czmq build failures with make check...

@bluca
Copy link
Member

bluca commented Oct 2, 2017

This is the required diff for CZMQ:

bluca/czmq@056f433

Usage of PLAIN is the worst offender - the domain was basically never set. There is also one incompatibility for CURVE - before with the ZAP handler available but with no domain set, the connection was rejected, but now it is allowed.

So perhaps a new option as you suggested to keep legacy behaviour as expected for now would be a good solution instead, with a big, fat warning in the NEWS to let users know the default is going to change.

A big problem I see is that there not only applications, but a million and a half bindings as well. Many are well maintained, but some are not, and breaking backward compatiblity means making the latter group useless.

Then when we tag v5.0.0 we can break backward compatibility in this regard and switch the defaults around.
In terms of releng, I would rather wait for a lot of the currently DRAFT stuff to be ready before tagging 5.0 - so we can do a big release with the new socket types, and so on.

What do you guys think?

@vyskocilm
Copy link
Contributor Author

@sigiesec thanks for explanation. It looks like fixable for czmq. The point is that zeromq should release libzmq and czmq with ZAP fixes at the same time. @bluca any thoughts?

@wesyoung
Copy link

wesyoung commented Oct 3, 2017

i agree. breaking something at a patch level seems... awkward at best [may make things worse if zauth just starts ignoring stuff]. i'm poking through zyre to at-least make sure the defaults are set [and configurable]. we may wanna clean up the czmq zauth text a bit:

https://github.com/zeromq/czmq/blob/master/include/zauth.h#L54

that may have been where my brain suggested "i guess you don't need to set the domain" [and then it worked OK and ...]. even though the tests had set the domain a few times.. (may help bindings users with adoption if it's consistent imo).

i guess this might answer the social contract question- which is more important, the API or the RFC.. [and i think the 'API' is probably the right answer, the RFC is something we aspire to be..]. :)

@wesyoung
Copy link

wesyoung commented Oct 3, 2017

@bluca here are the ones' i'm testing if you wanna send your PR first (no hurry, once you get the feedback you feel comfortable with)

https://github.com/zeromq/czmq/compare/master...wesyoung:fix/zap-gossip?expand=1
https://github.com/zeromq/zyre/compare/master...wesyoung:fix/zap?expand=1

they initially appear to work OK, i did a little more re-factoring now that i'm wrapping my head around the gsl generated bits, so they're a little bigger- but figure you can fix the tests and i'll add this.. which should pass OK once we're settled on a direction in this issue.

also- i think i know how to apply the whole CURVE + ZAP thing to the zproto/gsl layer so the defaults are laid properly to new protocols moving forward.

@sigiesec
Copy link
Member

sigiesec commented Oct 5, 2017

@bluca sounds acceptable to me. I might add that option, but I fear it will at least until end of next week until I can work on that again. so if anyone else might implement that, don't hesitate ;) I will be happy to give it a look in terms of a review :)

bluca added a commit to bluca/libzmq that referenced this issue Oct 7, 2017
bluca added a commit to bluca/libzmq that referenced this issue Oct 8, 2017
bluca added a commit to bluca/libzmq that referenced this issue Oct 9, 2017
bluca added a commit to bluca/libzmq that referenced this issue Oct 9, 2017
Solution: add ZMQ_ZAP_DOMAIN_REQUIRED to hide backward incompatible
change and make it disabled by default.
In a future release that breaks API compatibility we can then switch
the default to enabled in order to achieve full RFC compatibility.

Fixes zeromq#2762
@bluca
Copy link
Member

bluca commented Oct 9, 2017

@sigiesec this is my proposal: bluca@0800772

I'll send a PR in a while once the CI is clean. I've tested it with both CZMQ and Malamute, both before and after fixing them to set the domain, and it seems to work in all cases.

@sigiesec
Copy link
Member

sigiesec commented Oct 9, 2017

I gave your proposal a quick look. It's somewhat different from what I had in mind. How would you evolve this for a future 5.0 version? Remove it? Change the default to 1 and disallow setting any other value?

I had a ZMQ_ENABLE_ZAP option in mind that defaults to -1 now, which means legacy behaviour. Other values are 0=disable and 1=enable. In 5.0.0, the -1 value would be removed, and the default change to either 1 or 0. This would also allow to change the default to something else already now via a compile-time option.

@bluca
Copy link
Member

bluca commented Oct 9, 2017

Yes the idea is that in 5.0.0 the default would change to true, and in 6.0.0 we then can remove it completely.

I'm not sure about ZMQ_ENABLE_ZAP - the issue I see is that it could be misleading, as ZAP can already be enabled, just not with 100% spec compliance.
By calling it _REQUIRED, it makes it immediately obvious that it's an additional, stricter enforcement, and it should be simpler to grasp for users. What do you think?

@sigiesec
Copy link
Member

sigiesec commented Oct 9, 2017

Honestly, I think both may be confusing in different ways but to a similar degree.
If clarity for users of the legacy behaviour is a priority, I would rather use a more neutral name such as ZMQ_ZAP_USAGE. Instead of using -1, 0 and 1 where the latter two may be viewed as boolean, some other values might be used.

@wesyoung
Copy link

wesyoung commented Oct 9, 2017

ZMQ_ZAP_ENFORCE ?

@bluca
Copy link
Member

bluca commented Oct 9, 2017

I'll open a PR with ZMQ_ZAP_ENFORCE_DOMAIN and then we can iterate from that after we get feedback, this way we can move on, ok?

bluca added a commit to bluca/libzmq that referenced this issue Oct 9, 2017
Solution: add ZMQ_ZAP_ENFORCE_DOMAIN to hide backward incompatible
change and make it disabled by default.
In a future release that breaks API compatibility we can then switch
the default to enabled in order to achieve full RFC compatibility.

Fixes zeromq#2762
bluca added a commit to bluca/libzmq that referenced this issue Oct 9, 2017
Solution: add ZMQ_ZAP_ENFORCE_DOMAIN to hide backward incompatible
change and make it disabled by default.
In a future release that breaks API compatibility we can then switch
the default to enabled in order to achieve full RFC compatibility.

Fixes zeromq#2762
@sigiesec
Copy link
Member

@bluca Sorry it took some time. Actually, I do not like this, but apparently it is too late. In #2711, we (or at least I ;) ) came to the conclusion that we can save the effort/complexity to add a new socket option to make controlling the activation/use/... of ZAP explicit, but that controlling it explicitly would be more desirable. Now a new socket option is introduced, so we have added complexity, but still the control is not explicit.
IMO this should definitely be changed (again) before a release is made. I hope I find time to do that.

@bluca
Copy link
Member

bluca commented Oct 13, 2017

Sorry about that - but as they say "perfect" is the enemy of "done" and given backward compatibility was affected I wanted to fix it sooner rather than later :-)
Feel free to send a new PR to change it - the new socket option is DRAFT so we can change it or remove as we see fit. As long as the overall default behaviour is backward compatible it will be fine.

@sigiesec
Copy link
Member

@bluca Never mind :) I just opened a new issue so that we/I do not forget about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants