-
-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlocks caused by openssl's handling of TLS 1.3 session tickets #819
Comments
Oh, this also causes a mess with The test that fails is: Lines 882 to 894 in 5a03bee
So this involves a somewhat delicate thing: if you open a TLS connection and then immediately close it again, how should that be handled? Specifically here we have the client calling And in HTTPS-compatible mode it is straightforward, because the client just closes the transport, and the server interprets that as EOF. But in standards-compliant mode, we're only supposed to report EOF if we got a proper cryptographic notification of the client's intention to close the connection. So in this case, the client's Except... with TLS 1.3, after the handshake, the server tries to send the session tickets. And the standard semantics for transport streams (as determined by e.g. TCP/BSD sockets) are that if your peer has closed the connection, then you can keep reading data until you get everything they sent before they closed it, but if you send data then you get an immediate error and lose all the data that you might have received. Which is what happens here: the client has closed the transport, so the server's attempt to send the session tickets raises an exception, and the client's clean shutdown notification gets lost, so the server never has a chance to see it. I... don't think there's anything we can actually do here? I think the server is simply stuck getting I guess we could swap the roles in the test, so the server's the one who closes it immediately – that would at least let us continue to exercise the aclose-triggering-handshake logic. |
Ugh, it's not just that one sequence I analyzed above that triggers this, it's a whole bunch of them that trigger it non-deterministically, probably depending on whether or not the scheduler lets the client close the connection before the server manages to enqueue the session tickets. |
Boiling this down further: the major issue with the auto-sending session tickets is that they cannot work reliably in any usage where the server never sends data to the client. As a special case, that includes our toy testcases that just do things like establish a connection and then close it immediately, but it's much more general than that. Also, it doesn't matter whether you use standards-compliant mode or HTTPS-compliant mode. Here are two openssl issues where people ran into these problems before me:
As noted in the second issue, we could potentially avoid the close-related problems if the client always did a full round-trip shutdown, because waiting for the server's close_notify would automatically consume the session tickets. Currently Trio never does a full round-trip shutdown: if we're in standards-compliant mode then we send a close_notify and then immediately close the connection, and if we're in HTTPS-compatible mode then we skip the close_notify entirely. I suppose we could detect when we're using TLS 1.3 and switch to attempting the full round-trip close_notify dance, and ignore errors as appropriate. This seems suboptimal though; upgrading to TLS 1.3 isn't supposed to add round-trips. And anyway, when we're a server we can't control what clients connecting to us do! And clients are specifically allowed by the standard to skip doing a full round-trip shutdown. The PR attempts to implement a different solution: it ignores EPIPE/ECONNRESET if those occur while attempting to send session tickets. But this has two problems AFAICT. First, we use memory BIOs and then do the I/O ourselves, so any EPIPE/ECONNRESET occurs later, after the call to But that doesn't really matter, because there's another problem that's much more serious. Even if you ignore EPIPE/ECONNRESET while sending the session tickets, that doesn't guarantee you can go back to calling ...Ugh, I guess I'm filing a bug on openssl. |
See python-trio#817 We don't have CI for this yet, but at least on my laptop it helps. However, it doesn't fix everything. - Add OP_ENABLE_MIDDLEBOX_COMPAT to the list of static ssl constants - Handle SSLZeroReturnError in more places - Fix some -- but not all -- of the places where session tickets cause failures. See python-trio#819
It's currently a mess because of how openssl v1.1.1 handles session tickets. There isn't consensus yet about whether this is an openssl bug or what: python-trio#819 openssl/openssl#7948 openssl/openssl#7967
It's currently a mess because of how openssl v1.1.1 handles session tickets. There isn't consensus yet about whether this is an openssl bug or what: python-trio#819 openssl/openssl#7948 openssl/openssl#7967
There's a lot more information about the issue in openssl/openssl#7948 and openssl/openssl#7967. At this point I'm pretty sure that the right answer is that the server should wait, and send the session tickets the first time it has some other reason to send data (e.g. because of a call to I guess the most elegant workaround would be to detect when openssl tries to send session tickets at the wrong time (= immediately after the handshake), and then instead of sending them we could stash them away to send later, when the right time arrives (= the next time some other operation requires sending data). The stashing part is pretty straightforward. The tricky part is reliably recognizing the session tickets, because openssl doesn't distinguish them from the rest of the handshake – it just gives you a bunch of opaque bytes to send back and forth, and you're not supposed to think too hard about which bytes represent which messages. But here we need to break that abstraction. My first thought was: maybe we could peek at the raw bytes to find some kind of message type tag. Unfortunately, it turns out that TLS 1.3 starts encrypting everything mid-way through the handshake. In particular, the session tickets, and the messages that come before them, are both encrypted. Unless we somehow extract the secrets from openssl, all we have is a length field. (And we don't have any way to extract the secrets.) However, I think there actually is a solution. Looking at RFC 8446, we see in section 4.6.1:
So openssl can't send tickets until after it sees the client's Finished message. And, the client Finished message always comes after the server Finished message – you can see this by examining Figures 1 through 4 that show all the possible handshake patterns, and also if I'm reading sections 4.4.1 and 4.4.4 correctly, the client Finished contains a hash of the server Finished. So when we loop calling
But! This means that we can actually reliably identify session tickets! If:
...then the only thing that can possibly be in the outgoing buffer at this point is session tickets. The server's handshake finished in step 2, so what else can AFAICT, this heuristic is currently 100% reliable. Should we worry about things changing in the future in a way that breaks this heuristic? In principle, TLS 1.3 allows the server to start sending application data before receiving the client Finished message (you can see this in the handshake diagrams), so you might think
And David Benjamin says "I don't think anyone bothers". OpenSSL does have an API for this – in TLS 1.3, "early data" is a concept that only makes sense for clients, so they decided that if a server calls their "early data" functions, then that's how you write application data before seeing the client Finished. But this involves a completely different API that replaces Of course at some point we'll want to add support for early data (#223), so we'll need to revisit this then. But for now neither the stdlib Is there anything else that
So...... unless I'm missing something, this is both reliable and safe, so we should probably do it! |
Oh, one more thing that's kind of obvious but probably worth saying explicitly: the heuristic above is also robust enough to work correctly if we're using a version of openssl where this bug is already fixed (like boringssl). In this case, at step 4 there simply won't be any session tickets in the outgoing buffer, so our stashing logic will stash 0 bytes, and everything works fine. |
David Benjamin was kind enough to read over my post above, and says: "Your state listing is missing HelloRetryRequest, but I think the heuristic works." |
As a small coda to this whole saga: it turns out that just like I predicted, this openssl issue has actually been causing data loss in real world situations, with protocols that use unidirectional connections like FTPS and syslog, e.g.:
Which sucks, but at least it's nice to know that the effort we put into this was dealing with a real problem. (Also if openssl keeps dragging their feet we might eventually be forced to add some kind of workaround on the client side to deal with broken servers, but I guess we can wait until someone hits this in real life.) |
We have a test that tries running
SSLStream
over an unbuffered transport, in part to confirm we aren't making assumptions about buffering. It turns out that openssl 1.1.1/TLS 1.3 does make assumptions about buffering, and the test hangs. Now what?Specifically: we create an unbuffered transport, then call
do_handshake
on both the client and server sides. Theclient.do_handshake()
completes successfully. Theserver.do_handshake()
hangs.I believe what's happening is that when openssl 1.1.1 is doing TLS 1.3, the server automatically sends some session tickets after the handshake completes. Technically these aren't part of the handshake, so the
client.do_handshake()
call doesn't wait for them. But on the server side we don't know this; all we can see is that we calledSSL_do_handshake
and it spit out some data to send, so we assume that the handshake won't complete until we send that data.I guess there are two scenarios that the designers imagined: (1) the client is trying to read application-level data, so it's reading from the transport, and it gets the session tickets as a bonus. (2) the client isn't trying to read application-level data, so it's not reading from the transport, but the transport has enough intrinsic buffering that the session tickets can just sit there forever without causing any problems. But in our test neither is true, so it breaks.
As far as I know, this shouldn't have too much real-world impact, since real transports almost always have some buffering. (Empirically the session tickets appear to be 510 bytes, which is smaller than pretty much any real transport buffer.) And, there is a workaround available: modify
clogged_stream_maker
so the server doesawait server.do_handshake(); await server.send_all(b"x")
, and the client doesawait client.do_handshake(); await client.receive_some(1)
. Sending a bit of application level data from server→client like this should flush out the session tickets and get back to the state we want for the rest of the tests.Nonetheless, it's not ideal that openssl is making subtle assumptions about buffering like this. Shoudl we do something?
Unfortunately, TLS 1.3's decision to move session tickets out of the handshake doesn't give us a lot of options here. We could disable session tickets, but that seems sub-optimal (and currently the
ssl
module doesn't expose a knob for this anyway).We could potentially delay sending the session tickets until the first time we send application-level data. That way
server.do_handshake()
wouldn't need to make any assumptions about buffering, and it's totally reasonable thatserver.send_all(...)
has to wait for the peer to callclient.receive_some(...)
. But, there are two challenges:I'm not sure how to implement this. Specifically, it requires figuring out that the data openssl has dumped into its send buffer is the session tickets, as opposed to part of the handshake. Is there any reliable way to do this?
There are some cases where session tickets are successfully delivered now, but wouldn't be if we implemented this. Specifically, if you have a connection where the server never sends any application-level data, but the client does read from network. This sounds a little weird, but it could happen: e.g. any twisted or asyncio client will act like this; or, if you're using trio, you could intentionally implement a client like this exactly because you want to get session tickets.
So.... I'm not having any brilliant ideas here. I think for right now we'll just implement the workaround to get the test suite working, and then wait to see if this becomes an issue or not.
CC: @tiran – I don't think there's anything for you to do here in particular, but it seems like the kind of issue that you like to know about.
The text was updated successfully, but these errors were encountered: