Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in ssl3_read_bytes/ tls_get_message_header #23650

Closed
mteske opened this issue Feb 21, 2024 · 15 comments
Closed

crash in ssl3_read_bytes/ tls_get_message_header #23650

mteske opened this issue Feb 21, 2024 · 15 comments
Labels
resolved: not a bug The issue is not considered a bug triaged: bug The issue/pr is/fixes a bug

Comments

@mteske
Copy link

mteske commented Feb 21, 2024

used version is 3.0.7-24 als provided in RHEL9.
I get this crash from time to time:

(gdb) where
#0  0x00007f26bb0e14df in tls_get_message_header (mt=<synthetic pointer>, s=0x7f2684001940) at ssl/statem/statem_lib.c:1167
#1  read_state_machine (s=0x7f2684001940) at ssl/statem/statem.c:587
#2  state_machine (s=0x7f2684001940, server=<optimized out>) at ssl/statem/statem.c:442
#3  0x00007f26bb0d2969 in ssl3_read_bytes (s=<optimized out>, type=23, recvd_type=0x0, buf=0x7f25b0000d48 "", len=8192, peek=0, readbytes=0x7f25e69f39f0) at ssl/record/rec_layer_s3.c:1711
#4  0x00007f26bb0ab7fc in ssl3_read_internal (s=0x7f2684001940, buf=0x7f25b0000d48, len=8192, peek=0, readbytes=0x7f25e69f39f0) at ssl/s3_lib.c:4462
#5  0x00007f26bb0b2137 in SSL_read (s=<optimized out>, buf=<optimized out>, num=<optimized out>) at ssl/ssl_lib.c:1885

it looks like it does some kind of reset in ssl3_read_bytes() here:

   /*
     * Unexpected handshake message (ClientHello, NewSessionTicket (TLS1.3) or
     * protocol violation)
     */
    if ((s->rlayer.handshake_fragment_len >= 4)
            && !ossl_statem_get_in_handshake(s)) {
        int ined = (s->early_data_state == SSL_EARLY_DATA_READING);

        /* We found handshake data, so we're going back into init */
        ossl_statem_set_in_init(s, 1);

        i = s->handshake_func(s);

but it has been forgotten that init_buf is freed and reset to NULL (added around 2020).

@mteske mteske added the issue: bug report The issue was opened to report a bug label Feb 21, 2024
@mattcaswell mattcaswell added triaged: bug The issue/pr is/fixes a bug and removed issue: bug report The issue was opened to report a bug labels Feb 21, 2024
@mattcaswell
Copy link
Member

Does this happen in a client or a server?

@mteske
Copy link
Author

mteske commented Feb 21, 2024

this is on the client :
(gdb) p s->server
$3 = 0

I'll attach the result of
print *s
this might be interesting.

@mteske
Copy link
Author

mteske commented Feb 21, 2024

s.txt

the output of print *s in gdb.

@mattcaswell
Copy link
Member

If I understand the description of the crash correctly you are experiencing a crash in tls_get_message_header() on this line, because s->init_buf is NULL. Correct?:

    p = (unsigned char *)s->init_buf->data;

From the rest of your description it looks like it is attempting to re-enter "init" to deal with a new handshake message that has arrived after the initial handshake message has completed. It appears we are dealing with TLSv1.3 here (s->version == 772, which is TLSv1.3) - so typically this occurs when a server sends a NewSessionTicket or a KeyUpdate message.

After the initial handshake completes we expect the state machine to be in the "MSG_FLOW_FINISHED" state (i.e. s->statem.state == MSG_FLOW_FINISHED). When we call s->handshake_func(s, 1) from ss3_read_bytes(), this actually enters the state_machine function (ssl/statem/statem.c).

We are then supposed to hit this code:

openssl/ssl/statem/statem.c

Lines 351 to 406 in 19cc035

/* Initialise state machine */
if (st->state == MSG_FLOW_UNINITED
|| st->state == MSG_FLOW_FINISHED) {
if (st->state == MSG_FLOW_UNINITED) {
st->hand_state = TLS_ST_BEFORE;
st->request_state = TLS_ST_BEFORE;
}
s->server = server;
if (cb != NULL) {
if (SSL_IS_FIRST_HANDSHAKE(s) || !SSL_IS_TLS13(s))
cb(s, SSL_CB_HANDSHAKE_START, 1);
}
/*
* Fatal errors in this block don't send an alert because we have
* failed to even initialise properly. Sending an alert is probably
* doomed to failure.
*/
if (SSL_IS_DTLS(s)) {
if ((s->version & 0xff00) != (DTLS1_VERSION & 0xff00) &&
(server || (s->version & 0xff00) != (DTLS1_BAD_VER & 0xff00))) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
} else {
if ((s->version >> 8) != SSL3_VERSION_MAJOR) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
}
if (!ssl_security(s, SSL_SECOP_VERSION, 0, s->version, NULL)) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
if (s->init_buf == NULL) {
if ((buf = BUF_MEM_new()) == NULL) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
if (!BUF_MEM_grow(buf, SSL3_RT_MAX_PLAIN_LENGTH)) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
s->init_buf = buf;
buf = NULL;
}
if (!ssl3_setup_buffers(s)) {
SSLfatal(s, SSL_AD_NO_ALERT, ERR_R_INTERNAL_ERROR);
goto end;
}
s->init_num = 0;

You can see that, as long as we are in the MSG_FLOW_FINISHED state, then we are supposed to end up with reinitialising the init_buf.

Subsequently, a bit lower down beyond the end of the block above we change the state in MSG_FLOW_WRITING...and the write state machine will realise it has nothing to write, and immediately we move into MSG_FLOW_READING (which is the state we are in when printing out s).

A few of possibilities spring to mind as to how this process has gone wrong:

  • The init_buf gets cleared in 3 main scenarios: SSL_clear() has been called, SSL_free() has been called or the handshake has finished (we are in state MSG_FLOW_FINISHED). So one possibility is that SSL_free() or SSL_clear() has been called erroneously at some point and left us in some weird state.
  • Another possibility is that somehow the init_buf is being freed after it has been re-initialised at the beginning of the statem() function, but before we try and read the message header. Not sure how that could happen.
  • A third possibility, is that somehow we have re-entered the handshake but we've not started in the MSG_FLOW_FINISHED state, so the re-initialisation of the init_buf hasn't kicked in. Again I'm not sure how that could happen.

Not sure if there is any debugging you can do to help figure out which of these scenarios is occurring?

@mteske
Copy link
Author

mteske commented Feb 21, 2024

To the first question: yes. Unfortunately this is very hard to reproduce, it happens around once a week during startup. The server was just started before so maybe it was not completely initialized and sends some garbage in this state. The comment says "unexpected handshake message or protocol violation", it could well be some garbage over the line.
I have a coredump if you need anything else to be printed. Additionally I could try to build openssl myself after adding some debug/logging code (any wishes what to print here?) so that we have more info if this happens again. I'll try to find a way to reproduce it more easily but this could be difficult. Maybe I can provoke more crashes if I restart our system in a cycle.

@mteske
Copy link
Author

mteske commented Feb 21, 2024

good news, it looks like I can reproduce it with a restart loop, at least it happened once again. Let me know what you need to be logged (preferably stderr) or attach a patch and I'll try to reproduce it again. Meanwhile I'l get the source rpms...

@t8m
Copy link
Member

t8m commented Feb 21, 2024

Is the client possibly multithreaded and calling the SSL functions simultaneously from multiple threads?

@t8m
Copy link
Member

t8m commented Feb 21, 2024

It would also help further investigating if you could build your client against a clean latest OpenSSL release tarball instead of the Red Hat build to see if the problem is still reproducible.

@mteske
Copy link
Author

mteske commented Feb 21, 2024

yes it is multithreaded (the omniORB library). It's supposed to be one thread per connection or locked properly, but of course it could be a bug there as well. Maybe it's best if I put in some logging and try to recreate the crash and then either post it here or close the issue :-)

@t8m
Copy link
Member

t8m commented Feb 21, 2024

You could definitely see bugs like this if SSL_*() functions are called simultaneously against a single SSL object from multiple threads without proper call serialization through locking.

@mteske
Copy link
Author

mteske commented Feb 22, 2024

I'm still on it. Recreating a libssl replacement in a redhat system is not that easy, with all their patches and special stuff, but I think I got it finally.
I'll try to find out wrong usage first. If this seems not the case I'll try with the newest libssl, too.

@mteske
Copy link
Author

mteske commented Feb 22, 2024

I can now reproduce it with a self-built lib. In the core I find another thread being in ssl code, but with a different SSL * handle. This would allowed, I assume. I now try to somehow log the calls to the crucial functions in memory so I can see what happens exactly, maybe the error happened a short time before...

@t8m
Copy link
Member

t8m commented Feb 22, 2024

Yeah different SSL * in different threads should work with no problems.

@mteske
Copy link
Author

mteske commented Feb 22, 2024

I might have found an unprotected bit of code where SSL_pending is called. I suppose this could cause such a problem when called in parallel?

@mteske
Copy link
Author

mteske commented Feb 23, 2024

While investigating further, I found even more such problems. Especially when enabling bidirectional mode there seems to be a blocking read which hangs around while a send is done. But now I wonder how this actually worked for 10 days with heavy load :-) But I think it's clear now, that this is not an openssl problem, but a usage problem.

@t8m t8m added the resolved: not a bug The issue is not considered a bug label Feb 25, 2024
@t8m t8m closed this as not planned Won't fix, can't repro, duplicate, stale Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolved: not a bug The issue is not considered a bug triaged: bug The issue/pr is/fixes a bug
Projects
None yet
Development

No branches or pull requests

3 participants