Thread exhaustion #338

waynerobinson · 2017-03-16T05:23:38Z

I'm not sure how to describe this in a replicatable way. :(

But we have a Rails and a non-Rails app that we're using v3 of Shoryuken with.

After processing a number of requests (on a Mac) we are starting to see Error fetching message: can't create Thread: Resource temporarily unavailable errors.

Macs only allow a pretty small number of threads per process (I believe its 2048) and it is usually a pretty hard limit.

However, we only have concurrency set to 10 and no pool size configured for the DB.

I'm not sure how we're exhausting threads. :(

The text was updated successfully, but these errors were encountered:

waynerobinson · 2017-03-16T05:30:34Z

And I get a feeling it might be time-based rather than job-based.

(as in the leaks happen over time, rather than per message, as I can process thousands of messages after starting without issue)

waynerobinson · 2017-03-16T05:50:57Z

Reviewing further, this might be my issue.

It's only the non-Rails app and it only seems to be effected when I'm running it with foreman (which is the strangest part of all this).

waynerobinson · 2017-03-16T06:03:57Z

I couldn't find anything specific in the foreman changelogs, but updating the latest version (was only one behind) has resolved this issue.

phstc · 2017-03-16T11:15:09Z

Thanks @waynerobinson for giving all the details 🍻

I'm happy it's working fine with the latest foreman. Would you mind to send me the foreman version you were using, and the new one? Just in case someone else face the same.

waynerobinson · 2017-03-16T12:50:07Z

Yeah, I think I've boiled the problem down to something in one of my apps. Neither seems like Foreman or Shoryuken's fault. My early guess is a connection pooling problem with our DB. It was probably an unwise decision to use the mysql2 gem directly. 😅

phstc · 2017-03-16T13:44:14Z

@waynerobinson hm. Not sure if it's the case, but you need to set your database pool as the same size as your concurrency.

waynerobinson · 2017-03-16T20:43:24Z

Yup. It is. And it works for a while. Then something goes out of control and starts spawning a bunch of threads. Sigh.

phstc · 2017-03-16T20:53:00Z

It's been a while since last time I used mysql, but please let me know if I can help with debugging.

waynerobinson · 2017-03-16T21:03:23Z

Thanks. :-)

waynerobinson · 2017-03-17T06:42:59Z

Turns out this issue was caused by an uncaught exception in a library making HTTP calls via Faraday.

I'm unsure as to why this uncaught exception ended up causing this runaway thread problem though. Shouldn't a thread that ends up throwing an exception just get killed off and move on?

I'll see if I can replicate the issue with a simpler setup.

phstc · 2017-03-17T11:49:45Z

Shouldn't a thread that ends up throwing an exception just get killed off and move on?

It should:

Should a thread crash for any reason the thread will immediately be removed from the pool and replaced.
https://ruby-concurrency.github.io/concurrent-ruby/Concurrent/FixedThreadPool.html

I'll see if I can replicate the issue with a simpler setup.

that will help 🍻

waynerobinson · 2017-03-21T05:25:25Z

OK, turns out all my attempted problem-solving hasn't managed to solve this issue.

However, I think I have managed to replicate it.

Basically, it seems that when Shoryuken is running against an empty queue it starts to leak threads. Any time the queue is empty, it leaks threads. Whilst it is processing data, it will sometimes stop.

My suspicion is that it has something to do with the heartbeat timer task not getting cleaned up properly when the queue is actually empty and causing a deadlock somewhere, rather than being properly cleaned up.

When I set the HEARTBEAT_INTERVAL to a larger number, it leaks less.

I can replicate this on Ruby 2.3.1 with the following run.rb file:

class EchoWorker
  include Shoryuken::Worker
  shoryuken_options(
    queue:                    "cashdeck-dev-wayne-leaker",
    auto_delete:              true,
    body_parser:              :json
  )

  def perform(sqs_message, message)
    puts "HI #{Time.now}"
  end
end

With the following shoryuken.yml file:

verbose: true
concurrency: 10  # The number of allocated threads to process messages. Default 25
delay: 0        # The delay in seconds to pause a queue when it's empty. Default 0
queues:
  - [cashdeck-dev-wayne-leaker, 1]

I execute via:
bundle exec shoryuken start -r ./run.rb -L shoryuken.log -C shoryuken.yml

Obviously, you'll need to change the queue. This one has a 2 second wait timeout set.

Gemfile.lock

GEM
  remote: https://rubygems.org/
  specs:
    aws-sdk-core (2.8.8)
      aws-sigv4 (~> 1.0)
      jmespath (~> 1.0)
    aws-sigv4 (1.0.0)
    concurrent-ruby (1.0.5)
    jmespath (1.3.1)
    shoryuken (3.0.3)
      aws-sdk-core (~> 2)
      concurrent-ruby
      thor
    thor (0.19.4)

PLATFORMS
  ruby

DEPENDENCIES
  shoryuken

BUNDLED WITH
   1.14.3

waynerobinson · 2017-03-21T05:42:29Z

Interesting experiment during testing.

I increased the HEARTBEAT_INTERVAL and when it is higher (close to the Receive Message Wait Time, give or take some latency/processing times), the leaking goes away.

Also, the leaking seems to happen outside of the dispatch cycle itself.

I added puts "Dispatching threads=#{Thread.list.size}" to the top of the Shoryuken::Manager#dispatch method, and this number grows at a rate inversely proportionate to the HEARTBEAT_INTERVAL size, but only showing once per read related to the Receive Message Wait Time.

It seems like something is running the task every time and deadlocking and becoming a zombie thread, but the dispatch job is running appropriately.

waynerobinson · 2017-03-21T05:45:08Z

Also, removing the Receive Message Wait Time from the queue makes this problem go away altogether.

waynerobinson · 2017-03-21T06:47:42Z

There seems to be a bug in the timeout code for Concurrent::TimerTask

ruby-concurrency/concurrent-ruby#526

Basically, the timeout_interval seems to be being ignored.

Also, Concurrent::TimerTask doesn't do anything on a timeout apart from send a notification to its observers it seems that the timeout has occurred. So if the task is still running (due to something like Receive Message Wait Time) then it just schedules the next run.

Concurrent::TimerTask manages single-threaded execution of its tasks itself, so the @dispatching lock Shoryuken uses never helps prevent re-execution as dispatch isn't actually complete before the timeout ends up scheduling more executions.

Each time it schedules an execution it leaks a thread.

As for a work-around? I'm not sure. Timeouts in Concurrent::TimerTask seems broken right now and given the age of that PR I'm not sure if it is likely to be fixed. And you don't seem to be able to use Concurrent::TimerTask without timeouts and implement your own locking outside.

waynerobinson · 2017-03-21T23:42:19Z

Any thoughts on using http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Actor.html for the main dispatcher loop?

Using TimerTask seems like a bit of a hack (not to mention the implementation is broken). Shouldn't there just be a supervised main loop of requests for new messages?

I know Concurrent::Actor is listed as an edge-feature right now, but it seems to be the best fit.

The only other option I can think of is rolling this ourselves either with the dispatch loop on the main thread or via a couple of asynchronous methods watching each other.

The quick and dirty fix to the above would be to just use Concurrent::ScheduledTask directly, scheduling a new event at the end of either successfully or unsuccessfully processing the previous one as we don't really need to requests messages every x seconds, nor do we need timeout handling in the dispatcher on top of the timeouts already baked into the AWS/HTTP library.

phstc · 2017-03-23T00:54:05Z

Hi @waynerobinson

Great investigation. I need more time to investigate it better.

I did an initial implementation with Concurrent::Actor, but TBH, I only like the mailbox thing, I have mixed feelings on the messaging across actors. Besides being an edge feature as you pointed out.

heartbeat is kinda common for this type of solution, not sure if I would call it hacky.

def processor_done
  dispatch
end

def dispatch
  # return if no-queue or worker available and try again in 1 sec
  if no_processor && no_queue
    dispatch_later
    return
  end

  # after finishing requesting
  # call dispatch again
  dispatch
end

def dispatch_later
  @pool.post do
    sleep 1
    dispatch
  end

  # or even sleep in the main thread
  # sleep 1
  # dispatch
end

WDYT?

waynerobinson · 2017-03-23T01:07:42Z

Sorry, didn't mean to imply that a heartbeat is hacky, just the use of a periodically running task to manage it.

Ideally you'd have a thread (supervisor) that's responsible for keeping the dispatcher running and restarting it if it crashes. And the dispatcher can just run in a loop.

But given the main loop of Shoryuken is just a dispatch loop anyway, I don't see any specific problem with putting this on the main thread, unless you want to implement a pool of dispatchers at a later date.

I don't think the above code would work in Ruby as it doesn't like deep recursion (unlike Erlang/Elixir) and you'll get a stack level too deep issue.

But something like:

def processor_done
  dispatch_loop
end

def dispatch_loop
  loop
    if no_processor && no_queue
      sleep 1
    else
      dispatch
    end
  end
end

def dispatch
  # Actual dispatch code
end

phstc · 2017-03-23T01:20:16Z

Sorry, didn't mean to imply that a heartbeat is hacky, just the use of a periodically running task to manage it.

No need for sorry. I got your point 😄

Cool, I will play with it.

waynerobinson · 2017-03-23T04:32:04Z

No problem.

Also, didn't quite understand what processor_done was for but I don't think the above code makes sense to put in there.

Basically, we just need to find somewhere at the end of booting everything the dispatch loop can start in. Or run it in its own thread somewhere.

phstc · 2017-03-23T20:34:15Z

@waynerobinson for how long did you have to keep it running to replicate the issue?

waynerobinson · 2017-03-23T20:45:50Z

It happens very quickly on an empty queue.

Make sure the Receive Wait Time on the queue is set quite high (like 20 seconds) to make it worse. It increases at hundreds of threads per receive cycle.

Add another TimerTask that just prints out Thread.list.size to see the threads in use.

This was specifically with MRI, not sure if other rubies have lower thread limits on the default pools, but you should still experience a memory leak as the pool's queue size would grow.

waynerobinson · 2017-03-23T20:46:51Z

It basically doesn't occur if there are messages in a queue to process or the Receive Wait Time is set to 0.

waynerobinson · 2017-03-23T20:58:47Z

Changed run.rb to the following:

Shoryuken.sqs_client_receive_message_opts = {
  wait_time_seconds: 20
}

class EchoWorker
  include Shoryuken::Worker
  shoryuken_options(
    queue:                    "cashdeck-dev-wayne-leaker",
    auto_delete:              true,
    body_parser:              :json
  )

  def perform(sqs_message, message)
    puts "HI #{Time.now}"
  end
end

task = Concurrent::TimerTask.new(execution_interval: 1) do
  Shoryuken.logger.debug "Threads: #{Thread.list.size}"
end
task.execute

And the output of a few cycles: https://gist.github.com/waynerobinson/20362b034694664bc87395032393a8fd

phstc · 2017-03-23T21:02:43Z

@waynerobinson awesome, I could reproduce that. I'm on it.

waynerobinson · 2017-03-23T21:08:52Z

Cheers. 👍

Fix #338

Stop queuing up heartbeat threads Fix #338

phstc · 2017-03-24T15:29:45Z

@waynerobinson 3.0.4 is out with your fix! 🍻

https://github.com/phstc/shoryuken/blob/master/CHANGELOG.md

waynerobinson closed this as completed Mar 16, 2017

waynerobinson reopened this Mar 21, 2017

waynerobinson mentioned this issue Mar 21, 2017

TimerTask uses timeout interval ruby-concurrency/concurrent-ruby#526

Closed

phstc pushed a commit that referenced this issue Mar 24, 2017

Stop queuing up heartbeat threads

711a21d

Fix #338

phstc mentioned this issue Mar 24, 2017

Stop queuing up heartbeat threads #345

Merged

phstc pushed a commit that referenced this issue Mar 24, 2017

Stop queuing up heartbeat threads

ddc901a

Fix #338

phstc pushed a commit that referenced this issue Mar 24, 2017

Stop queuing up heartbeat threads

7baae10

Fix #338

phstc pushed a commit that referenced this issue Mar 24, 2017

Stop queuing up heartbeat threads

29b6684

Fix #338

phstc pushed a commit that referenced this issue Mar 24, 2017

Stop queuing up heartbeat threads

9395fe6

Fix #338

phstc closed this as completed in #345 Mar 24, 2017

phstc added a commit that referenced this issue Mar 24, 2017

Merge pull request #345 from phstc/fix-338

2e0ad70

Stop queuing up heartbeat threads Fix #338

bnferguson mentioned this issue May 25, 2018

Thread leak in logs-drain (potentially due to TimerTask bug) travis-ci/travis-logs#178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread exhaustion #338

Thread exhaustion #338

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 16, 2017 •

edited

Loading

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017 •

edited

Loading

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 17, 2017

phstc commented Mar 17, 2017

waynerobinson commented Mar 21, 2017 •

edited

Loading

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 24, 2017

Thread exhaustion #338

Thread exhaustion #338

Comments

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 16, 2017 • edited Loading

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017 • edited Loading

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

phstc commented Mar 16, 2017

waynerobinson commented Mar 16, 2017

waynerobinson commented Mar 17, 2017

phstc commented Mar 17, 2017

waynerobinson commented Mar 21, 2017 • edited Loading

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

waynerobinson commented Mar 21, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 23, 2017

waynerobinson commented Mar 23, 2017

phstc commented Mar 24, 2017

waynerobinson commented Mar 16, 2017 •

edited

Loading

phstc commented Mar 16, 2017 •

edited

Loading

waynerobinson commented Mar 21, 2017 •

edited

Loading