test t/cycle is racey and fails periodically #50

SpamapS · 2016-11-30T00:19:13Z

This is one of the causes of spurious retries on our CI tests. I've been doing some analysis, and it was relatively easy to get the problem to repeat. After 14 tries, this happened:

tried 13 times
cycle.kill.kill					2:000140750 [ ok ]
cycle.worker.single startup/shutdown					0:001611553 [ ok ]
cycle.server_startup().server_startup(1)					1:001624276 [ ok ]
cycle.server_startup().server_startup(many)					20:047278546 [ ok ]
cycle.server_startup().shutdown_and_remove()					0:000085085 [ ok ]
cycle.server_startup().server_startup(many)					20:049186806 [ ok ]
cycle.server_startup().server_startup() with bind() conflict					0:000000629 [ ok ]
tried 14 times
cycle.kill.kill					2:000180426 [ ok ]
cycle.worker.single startup/shutdown					0:003570175 [ ok ]
cycle.server_startup().server_startup(1)					1:003090167 [ ok ]
libtest/client.cc:268: in start() pid(8858) localhost:54428 ping(libtest/client.cc:268: Connection refused), additionally pid: 9031 is alive: true waited: 17 server started. exec: /home/clint/src/gearman/gearmand/libtool --mode=execute /home/clint/src/gearman/gearmand/./gearmand/gearmand --verbose=INFO --log-file=var/log/gearmand.logNt7BSZ --pid-file=var/run/gearmand.pidLHPdDC --port=54428 --listen=localhost  stderr:

tests/cycle.cc:113: in server_startup_multiple_TEST() pid(8858) Assertion '__server_startup_TEST((cycle_context_st*)obj, 20)' != 'TEST_SUCCESS'
cycle.server_startup().server_startup(many)					[ failed ]
cycle.server_startup().shutdown_and_remove()					0:016349708 [ ok ]
cycle.server_startup().server_startup(many)					20:037825710 [ ok ]
cycle.server_startup().server_startup() with bind() conflict					0:000000453 [ ok ]

The mentioned log file shows this:

   INFO 2016-11-29 23:49:39.253396 [  main ] Initializing Gear on port 54428 with SSL: false
   INFO 2016-11-29 23:49:39.000000 [  main ] Starting up with pid 9031, verbose is set to INFO
  ERROR 2016-11-29 23:50:00.000000 [  main ] Timeout occurred when calling bind() for 127.0.0.1:54428 -> libgearman-server/gearmand.cc:688
   INFO 2016-11-29 23:50:00.000000 [  main ] Shutdown complete

Normal runs show this:

   INFO 2016-11-29 23:45:36.616470 [  main ] Initializing Gear on port 22942 with SSL: false
   INFO 2016-11-29 23:45:36.000000 [  main ] Starting up with pid 5764, verbose is set to INFO
   INFO 2016-11-29 23:45:36.000000 [  main ] Listening on 127.0.0.1:22942 (13)
   INFO 2016-11-29 23:45:36.000000 [  main ] Adding event for listening socket (13)
   INFO 2016-11-29 23:45:36.000000 [  main ] Accepted connection from 127.0.0.1:49006
   INFO 2016-11-29 23:45:36.000000 [     4 ] Peer connection has called close()
   INFO 2016-11-29 23:45:36.000000 [     4 ] Disconnected 127.0.0.1:49006
   INFO 2016-11-29 23:45:36.000000 [     4 ] Gear connection disconnected: -:-
   INFO 2016-11-29 23:45:43.000000 [  main ] Clearing event for listening socket (13)
   INFO 2016-11-29 23:45:43.000000 [  main ] Closing listening socket (13)
   INFO 2016-11-29 23:45:43.000000 [  main ] Shutdown complete

My guess is that the function that detects free ports doesn't hang on to that port, and so other things happening on the box take that port, causing an error/timeout in binding to it.

The text was updated successfully, but these errors were encountered:

p-alik · 2016-11-30T15:41:20Z

It looks like some of simultaneously running tests occasionally trying to start gearmand on the same port.
As mentioned in #52 the test never fails with make -j1 test.

esabol · 2016-11-30T16:22:12Z

Could we run the other tests with -jN and then, after they've finished, run this test with -j1 (or vice versa)?

SpamapS · 2016-11-30T18:43:42Z

It's not simultaneous running that is causing the problem. I have experienced the problem just running t/cycle like this:

$ tries=0 ; while t/cycle ; do let tries=$tries+1 ; echo "Tried $tries times" ; date ; done

p-alik · 2016-12-01T10:58:17Z

Indeed, while loop tests fail sometimes. "Address already in use" is the cause. Is get_free_port blame for it?

But some of my test stuck at all and I have no clue what the reason is.

SpamapS · 2016-12-01T18:43:49Z

They get stuck forever because of #51 unfortunately. I think the kill signal gets sent at a time when SIGTERM is ignored for some reason.

get_free_port isn't to blame directly, it's just the fact that it finds random ports, and then doesn't hold them. So there is a race between when it closes the bind() that it does, and when gearmand binds to it.

I actually think the right thing to do in this case is to have gearmand find the random port by binding to 0 and logging whatever the OS gave it, and then grabbing the port out of the log file.

Another method would be to spin the gearmand's up in containers where they can all just bind to 4730, and then run the connection test inside the container. But that seems overly complex.

SpamapS · 2016-12-02T18:29:30Z

I just discovered this. It might be what we need:

https://github.com/google/python_portpicker

BrianAker · 2016-12-02T19:26:10Z

So here is the problem, you can pass 0 to port(2) and get a free port. The problem is how do you get this information back to the client? Now you can try to do a release and handover to another client. The problem is of course that if the mini app that was using the port doesn’t release in a sane manner then you have a problem (there is also a race condition, but there is a trick around this (which may no longer be working)). One solution is that you can have the application write out a file with the port in it! Next problem is knowing what the name of the file where the port will be in. If the server has multiple instances on the same machine you need to hand the server the name of the file before it starts up (and will need to guarantee that it is unique). — Brian

On Dec 2, 2016, at 10:29 AM, SpamapS ***@***.***> wrote: I just discovered this. It might be what we need: https://github.com/google/python_portpicker <https://github.com/google/python_portpicker> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAh35k_8ZizXmWx5WRpof1iVCRc4uTXks5rEGOKgaJpZM4K_psw>.

乌

SpamapS · 2016-12-02T20:51:58Z

Yeah it's a real pain. Looking more at python_portpicker, it's basically the same approach as libtest already uses, so the race is still possible.

What I think would work is to just add a --dynamic-port=var/run/foo and then use the same scheme as we use for logfile/pidfile to get a unique port filename.

BrianAker · 2016-12-02T21:52:46Z

I would agree about passing the filename for port. I do not know why, but for some reason I seem to remember catch-22 for this issue. At one point in time, Gearman was one of only two apps that could break the UUID generator on Linux. It can run on a massive number of cores with very little lost to locking. Its pounded out a few bugs in the Kernel. The libhostile are not meant to always succeed because they can show bugs in the underlying OS. It is not a Queue as well! :)

On Dec 2, 2016, at 12:52 PM, SpamapS ***@***.***> wrote: Yeah it's a real pain. Looking more at python_portpicker, it's basically the same approach as libtest already uses, so the race is still possible. What I think would work is to just add a --dynamic-port=var/run/foo and then use the same scheme as we use for logfile/pidfile to get a unique port filename. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAh38qUbRD8y5UM_0MEmoDZT6n2cAOuks5rEITwgaJpZM4K_psw>.

乌

SpamapS · 2016-12-02T22:13:58Z

Yeah, I opened up issue #24 to deal with the problem of actually using the information we get back from libhostile. Basically.. I need to know the random seed used, or I can't reproduce and actually find the error. I think it might end up in a log file somewhere, so I'm also looking at adding artifact support from Travis (just have to get an S3 account setup that will auto-prune so that Gearman doesn't bankrupt me).

Sonophoto · 2016-12-27T00:49:44Z

@SpamapS Have you seen this project? It is a gcc plugin for tracing shared memory.

https://github.com/blucia0a/CTraps-gcc

Executive Summary: https://github.com/blucia0a/CTraps-gcc/blob/master/README.orig

Use case 1) sounds like it might expose the insight you need for this bug...

SpamapS self-assigned this Nov 30, 2016

SpamapS added the bug label Nov 30, 2016

This was referenced Nov 30, 2016

test t/cycle sometimes gets stuck waiting for a gearmand to die #51

Open

Work around problems with ports in t/cycle #52

Closed

SpamapS mentioned this issue Nov 6, 2017

ISSUE #134 - failing compilation on Debian 9 #145

Closed

SpamapS mentioned this issue Dec 5, 2018

t/memcached test fails occasionally in Travis CI #224

Open

bmeynell mentioned this issue Dec 17, 2018

The feature of persistence plugins #227

Open

p-alik added the core label Dec 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test t/cycle is racey and fails periodically #50

test t/cycle is racey and fails periodically #50

SpamapS commented Nov 30, 2016

p-alik commented Nov 30, 2016

esabol commented Nov 30, 2016

SpamapS commented Nov 30, 2016

p-alik commented Dec 1, 2016

SpamapS commented Dec 1, 2016

SpamapS commented Dec 2, 2016

BrianAker commented Dec 2, 2016 via email

SpamapS commented Dec 2, 2016

BrianAker commented Dec 2, 2016 via email

SpamapS commented Dec 2, 2016

Sonophoto commented Dec 27, 2016

test t/cycle is racey and fails periodically #50

test t/cycle is racey and fails periodically #50

Comments

SpamapS commented Nov 30, 2016

p-alik commented Nov 30, 2016

esabol commented Nov 30, 2016

SpamapS commented Nov 30, 2016

p-alik commented Dec 1, 2016

SpamapS commented Dec 1, 2016

SpamapS commented Dec 2, 2016

BrianAker commented Dec 2, 2016 via email

SpamapS commented Dec 2, 2016

BrianAker commented Dec 2, 2016 via email

SpamapS commented Dec 2, 2016

Sonophoto commented Dec 27, 2016