Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove timeout from TimerTask #926

Merged

Conversation

jacobat
Copy link
Contributor

@jacobat jacobat commented Oct 27, 2021

The timeout in TimerTask is not ensuring tasks are not allowed to continue
processing after the timeout has passed. This can lead to threads leaking since
TimerTask will try to run the provided task again before the previous execution
has completed. Yet TimerTask will not allow the task to run in parallel so more
and more worker threads will be queued up waiting to be scheduled for executing
the task.

To illustrate, imagine running a TimerTask with an execution interval of 1 and
a timeout interval of 1, with the task itself running for 4 seconds.

Time    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
Worker1   s>  t  -  -  <     >  t  .  .  .  .  .  -  -  -  <
Worker2       s  >  t  .  -  -  -  <     >  t  .  .  .  .  .
Worker3             s  >  t  .  .  .  -  -  -  <     >  t  .
Worker4                         s  >  t  .  .  .  .  .  .  .
Worker5                                     s  >  t  .  .  .

At t=1 worker 1 is spawned(s) and scheduled(>) and will start executing
the task. It will timeout(t) after 1 second and cause the spawning of worker 2.
Worker 2 will then wait 1 second to be scheduled and then another second to
timeout causing the spawn of worker 3 at t=4.

Worker 3 is then scheduled to start at t=5 and will timeout at t=6. At this
point worker 1 has completed it's previous task so the task queued by worker 3
will go to worker 1 to be scheduled for t=7. At t=8 worker 1 will timeout and
since worker 2 is currently executing(-) and worker 3 is current waiting(.) for
worker 2 to completed worker 4 will be spawned.

This patterns will continue to repeat with new workers/threads spawned every 4
seconds.

Closes #639.

The timeout in TimerTask is not ensuring tasks are not allowed to continue
processing after the timeout has passed. This can lead to threads leaking since
TimerTask will try to run the provided task again before the previous execution
has completed. Yet TimerTask will not allow the task to run in parallel so more
and more worker threads will be queued up waiting to be scheduled for executing
the task.

To illustrate, imagine running a TimerTask with an execution interval of 1 and
a timeout interval of 1, with the task itself running for 4 seconds.

Time    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
Worker1   s>  t  -  -  <     >  t  .  .  .  .  .  -  -  -  <
Worker2       s  >  t  .  -  -  -  <     >  t  .  .  .  .  .
Worker3             s  >  t  .  .  .  -  -  -  <     >  t  .
Worker4                         s  >  t  .  .  .  .  .  .  .
Worker5                                     s  >  t  .  .  .

At t=1 worker 1 is spawned(s) and scheduled(>) and will start executing
the task. It will timeout(t) after 1 second and cause the spawning of worker 2.
Worker 2 will then wait 1 second to be scheduled and then another second to
timeout causing the spawn of worker 3 at t=4.

Worker 3 is then scheduled to start at t=5 and will timeout at t=6. At this
point worker 1 has completed it's previous task so the task queued by worker 3
will go to worker 1 to be scheduled for t=7. At t=8 worker 1 will timeout and
since worker 2 is currently executing(-) and worker 3 is current waiting(.) for
worker 2 to completed worker 4 will be spawned.

This patterns will continue to repeat with new workers/threads spawned every 4
seconds.
@jacobat jacobat force-pushed the remote-timeout-from-timer-task branch from f0d68fb to 9af0f8c Compare October 27, 2021 14:44
@sl0thentr0py
Copy link

Is there an update on this? We would like to use TimerTask in Sentry but the thread leaking is a serious issue that needs to be fixed because it would likely affect our downstream users.

@chrisseaton
Copy link
Member

@sl0thentr0py working on this today.

@chrisseaton
Copy link
Member

@sl0thentr0py I don't quite get this - the proposed fix just removes the functionality for timeout, because it's not useful. It doesn't stop you leaking threads if you want to schedule every second something that takes an hour to run. You're still going to have tons of threads in that case. So is this really what you wanted?

@sl0thentr0py
Copy link

@chrisseaton We went with our own simple threading solution for now anyway, so this isn't really blocking for us anymore.

@chrisseaton
Copy link
Member

Note that I'll follow up this PR with one that restores the API, and so library compatibility, but warns.

@jacobat
Copy link
Contributor Author

jacobat commented Mar 8, 2022

@sl0thentr0py I don't quite get this - the proposed fix just removes the functionality for timeout, because it's not useful. It doesn't stop you leaking threads if you want to schedule every second something that takes an hour to run. You're still going to have tons of threads in that case. So is this really what you wanted?

It's been a while, so my memory may be deceiving me. But I believe this stops the thread leaking since it's the timeout that causes the thread leakage. If there is no timeout, the task won't be rescheduled (due to a timeout) until after the previous invocation of the task has completed and so you'll only ever have one worker thread.

@mhenrixon
Copy link

Shame, I just had to reimplement the entire old timer task because the current one doesn't timeout: mhenrixon/sidekiq-unique-jobs#702

I already suggested a fix for the issue with the timeout leaking threads and proposed a suggestion by linking to my implementation of the timer task (it used to be really easy with the diff because it was just a few methods).

See the complaint here: mhenrixon/sidekiq-unique-jobs#701

@eregon
Copy link
Collaborator

eregon commented Apr 10, 2022

As Petr mentioned in #749 (comment), Using Thread#raise (or Thread#kill) is unsafe and brings more problems than solutions. For instance ensure blocks might be arbitrarily interrupted, and that can cause to leak a connection, a file descriptor, to keep a Mutex locked when it should not (and then cause deadlocks), to corrupt data, to leak the response of a request to another (as explained in the blog post), etc.
So while some users of TimerTask using a timeout may not have experienced an issue yet, there will be an issue sooner or later if the task uses any ensure block, or any kind of state which would be invalid if the task is interrupted arbitrarily in the middle, which is very common.

@mhenrixon
Copy link

@eregon any suggestions for other ways of doing this? The execution has to timeout one way or another.

The problem with the way the timer task works after the removal of the timeout is that it obviously leads to resources being hogged according to user report (see link to issue in previous comment).

I read the blog post now, had not seen that one before.

@chrisseaton
Copy link
Member

Can you do co-operative timeout? Ideally your code would check if it should stop frequently. Then it can stop in an orderly way. This is how major VMs like the JVM are engineered - they stop at well-defined points only.

@mhenrixon
Copy link

Can you do co-operative timeout? Ideally your code would check if it should stop frequently. Then it can stop in an orderly way. This is how major VMs like the JVM are engineered - they stop at well-defined points only.

Thanks @sl0thentr0py that's exactly where I was heading. The timeouts could explain some other random issues that people have been experiencing I suppose.

Never had to come up with such a solution before. Challenge: accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Thread leaks in TimerTask for long-executing jobs
5 participants