Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test t/cycle is racey and fails periodically #50

Open
SpamapS opened this issue Nov 30, 2016 · 11 comments
Open

test t/cycle is racey and fails periodically #50

SpamapS opened this issue Nov 30, 2016 · 11 comments
Assignees

Comments

@SpamapS
Copy link
Member

SpamapS commented Nov 30, 2016

This is one of the causes of spurious retries on our CI tests. I've been doing some analysis, and it was relatively easy to get the problem to repeat. After 14 tries, this happened:

tried 13 times
cycle.kill.kill					2:000140750 [ ok ]
cycle.worker.single startup/shutdown					0:001611553 [ ok ]
cycle.server_startup().server_startup(1)					1:001624276 [ ok ]
cycle.server_startup().server_startup(many)					20:047278546 [ ok ]
cycle.server_startup().shutdown_and_remove()					0:000085085 [ ok ]
cycle.server_startup().server_startup(many)					20:049186806 [ ok ]
cycle.server_startup().server_startup() with bind() conflict					0:000000629 [ ok ]
tried 14 times
cycle.kill.kill					2:000180426 [ ok ]
cycle.worker.single startup/shutdown					0:003570175 [ ok ]
cycle.server_startup().server_startup(1)					1:003090167 [ ok ]
libtest/client.cc:268: in start() pid(8858) localhost:54428 ping(libtest/client.cc:268: Connection refused), additionally pid: 9031 is alive: true waited: 17 server started. exec: /home/clint/src/gearman/gearmand/libtool --mode=execute /home/clint/src/gearman/gearmand/./gearmand/gearmand --verbose=INFO --log-file=var/log/gearmand.logNt7BSZ --pid-file=var/run/gearmand.pidLHPdDC --port=54428 --listen=localhost  stderr:

tests/cycle.cc:113: in server_startup_multiple_TEST() pid(8858) Assertion '__server_startup_TEST((cycle_context_st*)obj, 20)' != 'TEST_SUCCESS'
cycle.server_startup().server_startup(many)					[ failed ]
cycle.server_startup().shutdown_and_remove()					0:016349708 [ ok ]
cycle.server_startup().server_startup(many)					20:037825710 [ ok ]
cycle.server_startup().server_startup() with bind() conflict					0:000000453 [ ok ]

The mentioned log file shows this:

   INFO 2016-11-29 23:49:39.253396 [  main ] Initializing Gear on port 54428 with SSL: false
   INFO 2016-11-29 23:49:39.000000 [  main ] Starting up with pid 9031, verbose is set to INFO
  ERROR 2016-11-29 23:50:00.000000 [  main ] Timeout occurred when calling bind() for 127.0.0.1:54428 -> libgearman-server/gearmand.cc:688
   INFO 2016-11-29 23:50:00.000000 [  main ] Shutdown complete

Normal runs show this:

   INFO 2016-11-29 23:45:36.616470 [  main ] Initializing Gear on port 22942 with SSL: false
   INFO 2016-11-29 23:45:36.000000 [  main ] Starting up with pid 5764, verbose is set to INFO
   INFO 2016-11-29 23:45:36.000000 [  main ] Listening on 127.0.0.1:22942 (13)
   INFO 2016-11-29 23:45:36.000000 [  main ] Adding event for listening socket (13)
   INFO 2016-11-29 23:45:36.000000 [  main ] Accepted connection from 127.0.0.1:49006
   INFO 2016-11-29 23:45:36.000000 [     4 ] Peer connection has called close()
   INFO 2016-11-29 23:45:36.000000 [     4 ] Disconnected 127.0.0.1:49006
   INFO 2016-11-29 23:45:36.000000 [     4 ] Gear connection disconnected: -:-
   INFO 2016-11-29 23:45:43.000000 [  main ] Clearing event for listening socket (13)
   INFO 2016-11-29 23:45:43.000000 [  main ] Closing listening socket (13)
   INFO 2016-11-29 23:45:43.000000 [  main ] Shutdown complete

My guess is that the function that detects free ports doesn't hang on to that port, and so other things happening on the box take that port, causing an error/timeout in binding to it.

@p-alik
Copy link
Collaborator

p-alik commented Nov 30, 2016

It looks like some of simultaneously running tests occasionally trying to start gearmand on the same port.
As mentioned in #52 the test never fails with make -j1 test.

@esabol
Copy link
Member

esabol commented Nov 30, 2016

Could we run the other tests with -jN and then, after they've finished, run this test with -j1 (or vice versa)?

@SpamapS
Copy link
Member Author

SpamapS commented Nov 30, 2016

It's not simultaneous running that is causing the problem. I have experienced the problem just running t/cycle like this:

$ tries=0 ; while t/cycle ; do let tries=$tries+1 ; echo "Tried $tries times" ; date ; done

@p-alik
Copy link
Collaborator

p-alik commented Dec 1, 2016

Indeed, while loop tests fail sometimes. "Address already in use" is the cause. Is get_free_port blame for it?

But some of my test stuck at all and I have no clue what the reason is.

@SpamapS
Copy link
Member Author

SpamapS commented Dec 1, 2016

They get stuck forever because of #51 unfortunately. I think the kill signal gets sent at a time when SIGTERM is ignored for some reason.

get_free_port isn't to blame directly, it's just the fact that it finds random ports, and then doesn't hold them. So there is a race between when it closes the bind() that it does, and when gearmand binds to it.

I actually think the right thing to do in this case is to have gearmand find the random port by binding to 0 and logging whatever the OS gave it, and then grabbing the port out of the log file.

Another method would be to spin the gearmand's up in containers where they can all just bind to 4730, and then run the connection test inside the container. But that seems overly complex.

@SpamapS
Copy link
Member Author

SpamapS commented Dec 2, 2016

I just discovered this. It might be what we need:

https://github.com/google/python_portpicker

@BrianAker
Copy link
Member

BrianAker commented Dec 2, 2016 via email

@SpamapS
Copy link
Member Author

SpamapS commented Dec 2, 2016

Yeah it's a real pain. Looking more at python_portpicker, it's basically the same approach as libtest already uses, so the race is still possible.

What I think would work is to just add a --dynamic-port=var/run/foo and then use the same scheme as we use for logfile/pidfile to get a unique port filename.

@BrianAker
Copy link
Member

BrianAker commented Dec 2, 2016 via email

@SpamapS
Copy link
Member Author

SpamapS commented Dec 2, 2016

Yeah, I opened up issue #24 to deal with the problem of actually using the information we get back from libhostile. Basically.. I need to know the random seed used, or I can't reproduce and actually find the error. I think it might end up in a log file somewhere, so I'm also looking at adding artifact support from Travis (just have to get an S3 account setup that will auto-prune so that Gearman doesn't bankrupt me).

@Sonophoto
Copy link

@SpamapS Have you seen this project? It is a gcc plugin for tracing shared memory.

https://github.com/blucia0a/CTraps-gcc

Executive Summary: https://github.com/blucia0a/CTraps-gcc/blob/master/README.orig

Use case 1) sounds like it might expose the insight you need for this bug...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants