-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread exhaustion #338
Comments
And I get a feeling it might be time-based rather than job-based. (as in the leaks happen over time, rather than per message, as I can process thousands of messages after starting without issue) |
Reviewing further, this might be my issue. It's only the non-Rails app and it only seems to be effected when I'm running it with foreman (which is the strangest part of all this). |
I couldn't find anything specific in the foreman changelogs, but updating the latest version (was only one behind) has resolved this issue. |
Thanks @waynerobinson for giving all the details 🍻 I'm happy it's working fine with the latest foreman. Would you mind to send me the foreman version you were using, and the new one? Just in case someone else face the same. |
Yeah, I think I've boiled the problem down to something in one of my apps. Neither seems like Foreman or Shoryuken's fault. My early guess is a connection pooling problem with our DB. It was probably an unwise decision to use the mysql2 gem directly. 😅 |
@waynerobinson hm. Not sure if it's the case, but you need to set your database pool as the same size as your concurrency. |
Yup. It is. And it works for a while. Then something goes out of control and starts spawning a bunch of threads. Sigh. |
It's been a while since last time I used mysql, but please let me know if I can help with debugging. |
Thanks. :-) |
Turns out this issue was caused by an uncaught exception in a library making HTTP calls via Faraday. I'm unsure as to why this uncaught exception ended up causing this runaway thread problem though. Shouldn't a thread that ends up throwing an exception just get killed off and move on? I'll see if I can replicate the issue with a simpler setup. |
It should:
that will help 🍻 |
OK, turns out all my attempted problem-solving hasn't managed to solve this issue. However, I think I have managed to replicate it. Basically, it seems that when Shoryuken is running against an empty queue it starts to leak threads. Any time the queue is empty, it leaks threads. Whilst it is processing data, it will sometimes stop. My suspicion is that it has something to do with the heartbeat timer task not getting cleaned up properly when the queue is actually empty and causing a deadlock somewhere, rather than being properly cleaned up. When I set the I can replicate this on Ruby 2.3.1 with the following class EchoWorker
include Shoryuken::Worker
shoryuken_options(
queue: "cashdeck-dev-wayne-leaker",
auto_delete: true,
body_parser: :json
)
def perform(sqs_message, message)
puts "HI #{Time.now}"
end
end With the following verbose: true
concurrency: 10 # The number of allocated threads to process messages. Default 25
delay: 0 # The delay in seconds to pause a queue when it's empty. Default 0
queues:
- [cashdeck-dev-wayne-leaker, 1] I execute via: Obviously, you'll need to change the queue. This one has a 2 second wait timeout set. Gemfile.lock
|
Interesting experiment during testing. I increased the Also, the leaking seems to happen outside of the dispatch cycle itself. I added It seems like something is running the task every time and deadlocking and becoming a zombie thread, but the |
Also, removing the Receive Message Wait Time from the queue makes this problem go away altogether. |
There seems to be a bug in the timeout code for ruby-concurrency/concurrent-ruby#526 Basically, the Also,
Each time it schedules an execution it leaks a thread. As for a work-around? I'm not sure. Timeouts in |
Any thoughts on using http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Actor.html for the main dispatcher loop? Using TimerTask seems like a bit of a hack (not to mention the implementation is broken). Shouldn't there just be a supervised main loop of requests for new messages? I know The only other option I can think of is rolling this ourselves either with the dispatch loop on the main thread or via a couple of asynchronous methods watching each other. The quick and dirty fix to the above would be to just use |
Great investigation. I need more time to investigate it better. I did an initial implementation with
def processor_done
dispatch
end
def dispatch
# return if no-queue or worker available and try again in 1 sec
if no_processor && no_queue
dispatch_later
return
end
# after finishing requesting
# call dispatch again
dispatch
end
def dispatch_later
@pool.post do
sleep 1
dispatch
end
# or even sleep in the main thread
# sleep 1
# dispatch
end WDYT? |
Sorry, didn't mean to imply that a heartbeat is hacky, just the use of a periodically running task to manage it. Ideally you'd have a thread (supervisor) that's responsible for keeping the dispatcher running and restarting it if it crashes. And the dispatcher can just run in a loop. But given the main loop of Shoryuken is just a dispatch loop anyway, I don't see any specific problem with putting this on the main thread, unless you want to implement a pool of dispatchers at a later date. I don't think the above code would work in Ruby as it doesn't like deep recursion (unlike Erlang/Elixir) and you'll get a stack level too deep issue. But something like: def processor_done
dispatch_loop
end
def dispatch_loop
loop
if no_processor && no_queue
sleep 1
else
dispatch
end
end
end
def dispatch
# Actual dispatch code
end |
No need for sorry. I got your point 😄 Cool, I will play with it. |
No problem. Also, didn't quite understand what Basically, we just need to find somewhere at the end of booting everything the dispatch loop can start in. Or run it in its own thread somewhere. |
@waynerobinson for how long did you have to keep it running to replicate the issue? |
It happens very quickly on an empty queue. Make sure the Receive Wait Time on the queue is set quite high (like 20 seconds) to make it worse. It increases at hundreds of threads per receive cycle. Add another TimerTask that just prints out This was specifically with MRI, not sure if other rubies have lower thread limits on the default pools, but you should still experience a memory leak as the pool's queue size would grow. |
It basically doesn't occur if there are messages in a queue to process or the Receive Wait Time is set to 0. |
Changed Shoryuken.sqs_client_receive_message_opts = {
wait_time_seconds: 20
}
class EchoWorker
include Shoryuken::Worker
shoryuken_options(
queue: "cashdeck-dev-wayne-leaker",
auto_delete: true,
body_parser: :json
)
def perform(sqs_message, message)
puts "HI #{Time.now}"
end
end
task = Concurrent::TimerTask.new(execution_interval: 1) do
Shoryuken.logger.debug "Threads: #{Thread.list.size}"
end
task.execute And the output of a few cycles: https://gist.github.com/waynerobinson/20362b034694664bc87395032393a8fd |
@waynerobinson awesome, I could reproduce that. I'm on it. |
Cheers. 👍 |
Stop queuing up heartbeat threads Fix #338
@waynerobinson 3.0.4 is out with your fix! 🍻 https://github.com/phstc/shoryuken/blob/master/CHANGELOG.md |
I'm not sure how to describe this in a replicatable way. :(
But we have a Rails and a non-Rails app that we're using v3 of Shoryuken with.
After processing a number of requests (on a Mac) we are starting to see
Error fetching message: can't create Thread: Resource temporarily unavailable
errors.Macs only allow a pretty small number of threads per process (I believe its 2048) and it is usually a pretty hard limit.
However, we only have concurrency set to 10 and no pool size configured for the DB.
I'm not sure how we're exhausting threads. :(
The text was updated successfully, but these errors were encountered: