Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread exhaustion #338

Closed
waynerobinson opened this issue Mar 16, 2017 · 27 comments
Closed

Thread exhaustion #338

waynerobinson opened this issue Mar 16, 2017 · 27 comments

Comments

@waynerobinson
Copy link
Contributor

I'm not sure how to describe this in a replicatable way. :(

But we have a Rails and a non-Rails app that we're using v3 of Shoryuken with.

After processing a number of requests (on a Mac) we are starting to see Error fetching message: can't create Thread: Resource temporarily unavailable errors.

Macs only allow a pretty small number of threads per process (I believe its 2048) and it is usually a pretty hard limit.

However, we only have concurrency set to 10 and no pool size configured for the DB.

I'm not sure how we're exhausting threads. :(

@waynerobinson
Copy link
Contributor Author

waynerobinson commented Mar 16, 2017

And I get a feeling it might be time-based rather than job-based.

(as in the leaks happen over time, rather than per message, as I can process thousands of messages after starting without issue)

@waynerobinson
Copy link
Contributor Author

Reviewing further, this might be my issue.

It's only the non-Rails app and it only seems to be effected when I'm running it with foreman (which is the strangest part of all this).

@waynerobinson
Copy link
Contributor Author

I couldn't find anything specific in the foreman changelogs, but updating the latest version (was only one behind) has resolved this issue.

@phstc
Copy link
Collaborator

phstc commented Mar 16, 2017

Thanks @waynerobinson for giving all the details 🍻

I'm happy it's working fine with the latest foreman. Would you mind to send me the foreman version you were using, and the new one? Just in case someone else face the same.

@waynerobinson
Copy link
Contributor Author

Yeah, I think I've boiled the problem down to something in one of my apps. Neither seems like Foreman or Shoryuken's fault. My early guess is a connection pooling problem with our DB. It was probably an unwise decision to use the mysql2 gem directly. 😅

@phstc
Copy link
Collaborator

phstc commented Mar 16, 2017

@waynerobinson hm. Not sure if it's the case, but you need to set your database pool as the same size as your concurrency.

@waynerobinson
Copy link
Contributor Author

Yup. It is. And it works for a while. Then something goes out of control and starts spawning a bunch of threads. Sigh.

@phstc
Copy link
Collaborator

phstc commented Mar 16, 2017

It's been a while since last time I used mysql, but please let me know if I can help with debugging.

@waynerobinson
Copy link
Contributor Author

Thanks. :-)

@waynerobinson
Copy link
Contributor Author

Turns out this issue was caused by an uncaught exception in a library making HTTP calls via Faraday.

I'm unsure as to why this uncaught exception ended up causing this runaway thread problem though. Shouldn't a thread that ends up throwing an exception just get killed off and move on?

I'll see if I can replicate the issue with a simpler setup.

@phstc
Copy link
Collaborator

phstc commented Mar 17, 2017

Shouldn't a thread that ends up throwing an exception just get killed off and move on?

It should:

Should a thread crash for any reason the thread will immediately be removed from the pool and replaced.
https://ruby-concurrency.github.io/concurrent-ruby/Concurrent/FixedThreadPool.html

I'll see if I can replicate the issue with a simpler setup.

that will help 🍻

@waynerobinson waynerobinson reopened this Mar 21, 2017
@waynerobinson
Copy link
Contributor Author

waynerobinson commented Mar 21, 2017

OK, turns out all my attempted problem-solving hasn't managed to solve this issue.

However, I think I have managed to replicate it.

Basically, it seems that when Shoryuken is running against an empty queue it starts to leak threads. Any time the queue is empty, it leaks threads. Whilst it is processing data, it will sometimes stop.

My suspicion is that it has something to do with the heartbeat timer task not getting cleaned up properly when the queue is actually empty and causing a deadlock somewhere, rather than being properly cleaned up.

When I set the HEARTBEAT_INTERVAL to a larger number, it leaks less.

I can replicate this on Ruby 2.3.1 with the following run.rb file:

class EchoWorker
  include Shoryuken::Worker
  shoryuken_options(
    queue:                    "cashdeck-dev-wayne-leaker",
    auto_delete:              true,
    body_parser:              :json
  )

  def perform(sqs_message, message)
    puts "HI #{Time.now}"
  end
end

With the following shoryuken.yml file:

verbose: true
concurrency: 10  # The number of allocated threads to process messages. Default 25
delay: 0        # The delay in seconds to pause a queue when it's empty. Default 0
queues:
  - [cashdeck-dev-wayne-leaker, 1]

I execute via:
bundle exec shoryuken start -r ./run.rb -L shoryuken.log -C shoryuken.yml

Obviously, you'll need to change the queue. This one has a 2 second wait timeout set.

Gemfile.lock

GEM
  remote: https://rubygems.org/
  specs:
    aws-sdk-core (2.8.8)
      aws-sigv4 (~> 1.0)
      jmespath (~> 1.0)
    aws-sigv4 (1.0.0)
    concurrent-ruby (1.0.5)
    jmespath (1.3.1)
    shoryuken (3.0.3)
      aws-sdk-core (~> 2)
      concurrent-ruby
      thor
    thor (0.19.4)

PLATFORMS
  ruby

DEPENDENCIES
  shoryuken

BUNDLED WITH
   1.14.3

@waynerobinson
Copy link
Contributor Author

Interesting experiment during testing.

I increased the HEARTBEAT_INTERVAL and when it is higher (close to the Receive Message Wait Time, give or take some latency/processing times), the leaking goes away.

Also, the leaking seems to happen outside of the dispatch cycle itself.

I added puts "Dispatching threads=#{Thread.list.size}" to the top of the Shoryuken::Manager#dispatch method, and this number grows at a rate inversely proportionate to the HEARTBEAT_INTERVAL size, but only showing once per read related to the Receive Message Wait Time.

It seems like something is running the task every time and deadlocking and becoming a zombie thread, but the dispatch job is running appropriately.

@waynerobinson
Copy link
Contributor Author

Also, removing the Receive Message Wait Time from the queue makes this problem go away altogether.

@waynerobinson
Copy link
Contributor Author

There seems to be a bug in the timeout code for Concurrent::TimerTask

ruby-concurrency/concurrent-ruby#526

Basically, the timeout_interval seems to be being ignored.

Also, Concurrent::TimerTask doesn't do anything on a timeout apart from send a notification to its observers it seems that the timeout has occurred. So if the task is still running (due to something like Receive Message Wait Time) then it just schedules the next run.

Concurrent::TimerTask manages single-threaded execution of its tasks itself, so the @dispatching lock Shoryuken uses never helps prevent re-execution as dispatch isn't actually complete before the timeout ends up scheduling more executions.

Each time it schedules an execution it leaks a thread.

As for a work-around? I'm not sure. Timeouts in Concurrent::TimerTask seems broken right now and given the age of that PR I'm not sure if it is likely to be fixed. And you don't seem to be able to use Concurrent::TimerTask without timeouts and implement your own locking outside.

@waynerobinson
Copy link
Contributor Author

Any thoughts on using http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Actor.html for the main dispatcher loop?

Using TimerTask seems like a bit of a hack (not to mention the implementation is broken). Shouldn't there just be a supervised main loop of requests for new messages?

I know Concurrent::Actor is listed as an edge-feature right now, but it seems to be the best fit.

The only other option I can think of is rolling this ourselves either with the dispatch loop on the main thread or via a couple of asynchronous methods watching each other.

The quick and dirty fix to the above would be to just use Concurrent::ScheduledTask directly, scheduling a new event at the end of either successfully or unsuccessfully processing the previous one as we don't really need to requests messages every x seconds, nor do we need timeout handling in the dispatcher on top of the timeouts already baked into the AWS/HTTP library.

@phstc
Copy link
Collaborator

phstc commented Mar 23, 2017

Hi @waynerobinson

Great investigation. I need more time to investigate it better.

I did an initial implementation with Concurrent::Actor, but TBH, I only like the mailbox thing, I have mixed feelings on the messaging across actors. Besides being an edge feature as you pointed out.

heartbeat is kinda common for this type of solution, not sure if I would call it hacky.

def processor_done
  dispatch
end

def dispatch
  # return if no-queue or worker available and try again in 1 sec
  if no_processor && no_queue
    dispatch_later
    return
  end

  # after finishing requesting
  # call dispatch again
  dispatch
end

def dispatch_later
  @pool.post do
    sleep 1
    dispatch
  end

  # or even sleep in the main thread
  # sleep 1
  # dispatch
end

WDYT?

@waynerobinson
Copy link
Contributor Author

Sorry, didn't mean to imply that a heartbeat is hacky, just the use of a periodically running task to manage it.

Ideally you'd have a thread (supervisor) that's responsible for keeping the dispatcher running and restarting it if it crashes. And the dispatcher can just run in a loop.

But given the main loop of Shoryuken is just a dispatch loop anyway, I don't see any specific problem with putting this on the main thread, unless you want to implement a pool of dispatchers at a later date.

I don't think the above code would work in Ruby as it doesn't like deep recursion (unlike Erlang/Elixir) and you'll get a stack level too deep issue.

But something like:

def processor_done
  dispatch_loop
end

def dispatch_loop
  loop
    if no_processor && no_queue
      sleep 1
    else
      dispatch
    end
  end
end

def dispatch
  # Actual dispatch code
end

@phstc
Copy link
Collaborator

phstc commented Mar 23, 2017

Sorry, didn't mean to imply that a heartbeat is hacky, just the use of a periodically running task to manage it.

No need for sorry. I got your point 😄

Cool, I will play with it.

@waynerobinson
Copy link
Contributor Author

No problem.

Also, didn't quite understand what processor_done was for but I don't think the above code makes sense to put in there.

Basically, we just need to find somewhere at the end of booting everything the dispatch loop can start in. Or run it in its own thread somewhere.

@phstc
Copy link
Collaborator

phstc commented Mar 23, 2017

@waynerobinson for how long did you have to keep it running to replicate the issue?

@waynerobinson
Copy link
Contributor Author

It happens very quickly on an empty queue.

Make sure the Receive Wait Time on the queue is set quite high (like 20 seconds) to make it worse. It increases at hundreds of threads per receive cycle.

Add another TimerTask that just prints out Thread.list.size to see the threads in use.

This was specifically with MRI, not sure if other rubies have lower thread limits on the default pools, but you should still experience a memory leak as the pool's queue size would grow.

@waynerobinson
Copy link
Contributor Author

It basically doesn't occur if there are messages in a queue to process or the Receive Wait Time is set to 0.

@waynerobinson
Copy link
Contributor Author

Changed run.rb to the following:

Shoryuken.sqs_client_receive_message_opts = {
  wait_time_seconds: 20
}

class EchoWorker
  include Shoryuken::Worker
  shoryuken_options(
    queue:                    "cashdeck-dev-wayne-leaker",
    auto_delete:              true,
    body_parser:              :json
  )

  def perform(sqs_message, message)
    puts "HI #{Time.now}"
  end
end

task = Concurrent::TimerTask.new(execution_interval: 1) do
  Shoryuken.logger.debug "Threads: #{Thread.list.size}"
end
task.execute

And the output of a few cycles: https://gist.github.com/waynerobinson/20362b034694664bc87395032393a8fd

@phstc
Copy link
Collaborator

phstc commented Mar 23, 2017

@waynerobinson awesome, I could reproduce that. I'm on it.

@waynerobinson
Copy link
Contributor Author

Cheers. 👍

phstc pushed a commit that referenced this issue Mar 24, 2017
phstc pushed a commit that referenced this issue Mar 24, 2017
phstc pushed a commit that referenced this issue Mar 24, 2017
phstc pushed a commit that referenced this issue Mar 24, 2017
phstc pushed a commit that referenced this issue Mar 24, 2017
phstc added a commit that referenced this issue Mar 24, 2017
Stop queuing up heartbeat threads

Fix #338
@phstc
Copy link
Collaborator

phstc commented Mar 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants