-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requests to hyper server eventually start failing. #950
Comments
Does this problem occur without the use of openssl? |
I'm not sure. I have not yet managed to reproduce the problem in a short amount of time, and running the application without openssl is not possible. EDIT: To give an idea of the timeline between starting the application and it failing: The last time this happened was a week after the server application was started, during which about 90k requests were handled. The incident before that one occurred several days after the application was started. |
I'm possibly seeing the same issue. I haven't spoken up so far because I'm not sure it's Hyper-related or caused by my own code, and I haven't had the time to look into it yet. But since this issue is open now, I figured I'd chime in. If my issue is Hyper-related at all, it definitely only occurs in connection with OpenSSL. HTTPS requests don't get a reply, while HTTP requests are handled promptly. The issue goes away after a restart. This has been happening every 1-2 weeks since early August, now almost daily since late October. I didn't deploy a new version before the rate started picking up, so it must be related to external factors, too. I don't have traffic statistics at hand, but it can't be much. The page has near zero content and hasn't been announced anywhere, so I'd be surprised if anyone but me and the search engine bots were looking at it. Whether the issue is caused by Hyper+OpenSSL or my own code is hard to tell without further inspection, as the only real work is done in response to HTTPS requests. An HTTP request will only return a redirect to the equivalent @seanmonstar At the moment this has a very low priority for me, but if you wish, I can take the time to look into it and try to create a reduced test case. To be honest, I kinda hope this will just go away with async Hyper, so I wasn't planning on spending any effort on this, unless I need that website to go live before 0.10 lands. |
Very interesting. I wonder if the usage of I do hope that asynchronous IO helps in this way, and I know that the newer tokio-tls crate that hyper will be using has had much more work done. If this bug is because of hyper's usage of OpenSSL, one way to get around for now is to put nginx in front, and terminate TLS there, before sending requests to your hyper server. |
I confirm this issue. Without using openssl. Appeared after upgrading from 0.9.10 to 0.9.11, still exist in 0.9.12, even when hyper is without openssl. |
More about my environment: nginx is behind, most of queries to the server are simple http, not https. |
Downgrade to 0.9.10 didn't help. |
Error message in nginx log: 0 errors from the server itself. |
I can say it's not based on amount of requests. With 0.9.10 reboot doesn't help, with 0.9.12 reboot helps (although it's not a solution, so I'm on 0.9.9 right now). |
Hm, does turning off keep-alive fix it? You can pass |
Switching to 0.9.9 helped - no failures 5 hours already. Will try keep_alive on weekend. |
Nice, now 0.9.9 just failed and don't want to work after restart... Our main site is just offline and I can't do nothing. |
Looks like Iron doesn't give access to underlying Hyper Server, so keep-alive is not an option. |
@e-oz looks like you can pass a |
@seanmonstar at least it started this time, thanks. Will report how long it will work. |
For what it's worth, I've disabled keep-alive some months ago hoping that it would help, but it didn't seem to make any difference. But I'm guessing my issue is different from @e-oz's anyway, since I've never experienced that a restart didn't help. |
I've got multiple websites running on 0.9.11 with no issue, but I'm front-ending them with nginx, and nginx is handling all of the TLS and then proxying the content (hyper serves only to 127.0.0.1 so people can't bypass the nginx frontend). Also my handler sets connection-close if it is busy (and hyper enforces it), and that fixed the problem of clients tying up all the threads with keep-alives. With this setup, I've had no problems. Just another data-point . Also, I'm not using Iron, so there's that. |
@seanmonstar looks like for me switching keep-alive off solved the issue. @mikedilger I use nginx as a proxy too. |
And after 8 days it failed again, in night time, without any requests. |
My sincerest apology - I found reason of failures, it was segfault in another library. I'm sorry for false assumptions without any proofs, excuse me please. |
Update from me: I'm going to need that website I'm working on pretty soon now, so my motivation to figure out this issue has increased quite a bit. Unfortunately the frequency of the freezes has gone down since I saw daily freezes in early November, and I'm only seeing the problem every 1-2 weeks now. That makes it harder to figure out what's going on. None the less, I saw my first freeze today since I've pushed a version with more logging. As far as I can tell, all requests that enter my handler also leave it. That indicates the problem originates in Hyper (or below), which is what I suspected anyway. I'm going to keep investigating and will check back as I learn more. |
At the time of my last post, I deployed a new version with the following changes:
That version froze yesterday. This rules out my suspicion/hope that setting the timeouts would solve the problem. After a careful examination of the logs I learned some more things:
My plan now is to add more logging to my Hyper fork and deploy that later today. I'll check back once I learn more. |
I had another freeze. I think I can confirm that problem is related to, maybe even caused by, OpenSSL. After the last successful HTTPS request I have in the logs, there's a partial requests that stops right before calling Unfortunately I made a mistake placing the log messages within my I've already fixed my broken log messages and am going to deploy that fix shortly. If that substantiates my suspicion, I'll try and upgrade the OpenSSL crate to the latest version. |
@hannobraun you can't use openssl crate of the latest version with hyper. |
@e-oz I know that Hyper's stuck on 0.7 currently. I plan to upgrade my Hyper fork to the latest version. |
@hannobraun latest version of openssl is still not perfect, I'm experiencing segfaults with it. I'm not trying to stop you, though :) |
@e-oz Thanks for the info. Ideally, I'd like to fix the source of the problem. If that source happens to be rust-openssl, it makes sense to first check if it still occurs in the latest version. If I can't fix it, or the end result is not satisfactory for another reason, I can still fall back to using nginx for HTTPS. By the way, I'm also seeing segfaults, about once per month. Haven't started tracking those down yet :) |
Thanks for investigating this. I've had a hyper server running at meritbadge.herokuapp.com for more than a year, and it never hangs or segfaults. However, being on heroku means I don't use SSL... |
@seanmonstar No need to thank me. Just doing what I have to, to get that website online. Thank you, for building something that makes it all possible in the first place! |
@hannobraun This is quite a long shot, but I've recently seen surprising behaviour where OpenSSL uses locking callbacks provided externally, and if you link to something that provides such (e.g. python) then you could accidentally block OpenSSL calls with something like a python GIL not being unlocked: sfackler/rust-openssl#532. Like I say, a long shot, but if you are linking to python, something to consider. |
@mikedilger Thanks for the info, although I don't think it applies to my case. I'm not aware of anything in my dependencies that would register those callbacks. Certainly no Python in there. Very interesting bug though :-) |
Froze again. All evidence still points to |
And it froze again. My suspicions have been confirmed. The thread is definitely entering I've decided to wait for #985 to resolve before testing with the latest OpenSSL. In the meantime, I'm going to try Rustls [1][2]. Depending on how that goes, I might not even go back to OpenSSL afterwards. Let's see. |
Now here's a surprise: 3 days into my Rustls evaluation, the process froze again. Everything looked exactly the same as it did with OpenSSL! I don't know what this means. Maybe the issue is with Hyper after all, maybe it's with the operating system, or maybe OpenSSL and Rustls just happen to have the same bug. No idea. Unfortunately, when I switched to Rustls, I also upgraded to the latest Hyper release, which means I lost all the extra logging from my special Hyper fork. I'll re-add the logging and will report back once I know more. |
@hannobraun I'm curious if setting read and write timeouts before passing the socket to wrap_server would help at all. Depends if the hang is waiting on socket io, or some sort of Mutex deadlock. |
@seanmonstar Interesting, so the timeouts are set in I will test whether setting the timeouts first will solve this problem, but I guess this should be fixed in any case. Do you agree, or am I missing something here? |
It was just something I never thought, and the code in There isn't currently a way for the |
I'll look into it. At the very least I'll hack something together to verify that this causes the problem, but I'll also try to come up with a good solution. Would you want me to create a pull request against |
I'm publishing 0.10.0 right now (fixing docs deployment before publishing to crates.io, but asap!). So to allow it to be a non-breaking change, maybe a new method on the Again, I'm not certain this will help, as it may be a deadlock somewhere else. Just thinking that TLS handshake has both read and write IO, and with a blocking socket, it will just block until the other side responds or hangs up. |
This may be unrelated, but I'm curious if the people experiencing this issue are using Debian / OpenSSL? I experienced almost identical symptoms in a Python application I was working on some time ago, where after an inconsistent amount of time (sometimes days, sometimes weeks) the application would simply hang doing nothing, and the problem was essentially impossible to reproduce intentionally -- the time after which the issue occurred did not seem to correlate to the number of requests, so simply bombarding the server with requests would not trigger the bug. The underlying problem was that the Debian repo's version of OpenSSL would sometimes hang indefinitely when being called in Python's |
@ojensen5115 Not me. I'm using Arch Linux with OpenSSL 1.0.2j. It does sound like the same bug, though. On a related note: I've been running the |
As of today, my process has been running for 3 weeks without freezing. This is the longest confirmed uptime since I started tracking this issue in August. While this isn't firm proof, it's a strong sign that #1006 is indeed a fix for this problem. I welcome anyone who can reproduce this issue to update to the latest release and see if it helps. I'll keep my process running for as long as I can to further confirm this has been fixed, but I may have to deploy a new version soon. |
@hannobraun Excellent! Thanks for sticking through it, and reporting back so much. I'm going to close this issue, since it does seem like it is fixed. If it occurs again, we can re-open. |
@seanmonstar You're welcome, and thank you for your help. Glad to be rid of this problem (hopefully). |
I wrote a basic server application using hyper 0.9.10 that prints the IP, referrer, and body of each request to stdout (which I then redirect into a file):
Everything works as expected initially, but after some number of hours after starting the server, all requests to it begin to fail (and do so very slowly). When this happens, the results of
time curl -k ...
look like this:Restarting the server application corrects this issue and requests are handled within 100ms, tops.
I'm flummoxed.
The text was updated successfully, but these errors were encountered: