-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test t/cycle is racey and fails periodically #50
Comments
It looks like some of simultaneously running tests occasionally trying to start |
Could we run the other tests with -jN and then, after they've finished, run this test with -j1 (or vice versa)? |
It's not simultaneous running that is causing the problem. I have experienced the problem just running t/cycle like this:
|
Indeed, while loop tests fail sometimes. "Address already in use" is the cause. Is get_free_port blame for it? But some of my test stuck at all and I have no clue what the reason is. |
They get stuck forever because of #51 unfortunately. I think the kill signal gets sent at a time when SIGTERM is ignored for some reason. get_free_port isn't to blame directly, it's just the fact that it finds random ports, and then doesn't hold them. So there is a race between when it closes the bind() that it does, and when gearmand binds to it. I actually think the right thing to do in this case is to have gearmand find the random port by binding to 0 and logging whatever the OS gave it, and then grabbing the port out of the log file. Another method would be to spin the gearmand's up in containers where they can all just bind to 4730, and then run the connection test inside the container. But that seems overly complex. |
I just discovered this. It might be what we need: |
So here is the problem, you can pass 0 to port(2) and get a free port. The problem is how do you get this information back to the client?
Now you can try to do a release and handover to another client. The problem is of course that if the mini app that was using the port doesn’t release in a sane manner then you have a problem (there is also a race condition, but there is a trick around this (which may no longer be working)).
One solution is that you can have the application write out a file with the port in it!
Next problem is knowing what the name of the file where the port will be in. If the server has multiple instances on the same machine you need to hand the server the name of the file before it starts up (and will need to guarantee that it is unique).
— Brian
On Dec 2, 2016, at 10:29 AM, SpamapS ***@***.***> wrote:
I just discovered this. It might be what we need:
https://github.com/google/python_portpicker <https://github.com/google/python_portpicker>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAh35k_8ZizXmWx5WRpof1iVCRc4uTXks5rEGOKgaJpZM4K_psw>.
乌
|
Yeah it's a real pain. Looking more at python_portpicker, it's basically the same approach as libtest already uses, so the race is still possible. What I think would work is to just add a --dynamic-port=var/run/foo and then use the same scheme as we use for logfile/pidfile to get a unique port filename. |
I would agree about passing the filename for port.
I do not know why, but for some reason I seem to remember catch-22 for this issue.
At one point in time, Gearman was one of only two apps that could break the UUID generator on Linux. It can run on a massive number of cores with very little lost to locking. Its pounded out a few bugs in the Kernel. The libhostile are not meant to always succeed because they can show bugs in the underlying OS.
It is not a Queue as well!
:)
On Dec 2, 2016, at 12:52 PM, SpamapS ***@***.***> wrote:
Yeah it's a real pain. Looking more at python_portpicker, it's basically the same approach as libtest already uses, so the race is still possible.
What I think would work is to just add a --dynamic-port=var/run/foo and then use the same scheme as we use for logfile/pidfile to get a unique port filename.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAh38qUbRD8y5UM_0MEmoDZT6n2cAOuks5rEITwgaJpZM4K_psw>.
乌
|
Yeah, I opened up issue #24 to deal with the problem of actually using the information we get back from libhostile. Basically.. I need to know the random seed used, or I can't reproduce and actually find the error. I think it might end up in a log file somewhere, so I'm also looking at adding artifact support from Travis (just have to get an S3 account setup that will auto-prune so that Gearman doesn't bankrupt me). |
@SpamapS Have you seen this project? It is a gcc plugin for tracing shared memory. https://github.com/blucia0a/CTraps-gcc Executive Summary: https://github.com/blucia0a/CTraps-gcc/blob/master/README.orig Use case 1) sounds like it might expose the insight you need for this bug... |
This is one of the causes of spurious retries on our CI tests. I've been doing some analysis, and it was relatively easy to get the problem to repeat. After 14 tries, this happened:
The mentioned log file shows this:
Normal runs show this:
My guess is that the function that detects free ports doesn't hang on to that port, and so other things happening on the box take that port, causing an error/timeout in binding to it.
The text was updated successfully, but these errors were encountered: