Scanning::Job always cleanup all signals in cleanup #99

enoodle · 2017-08-23T06:55:14Z

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1481230

For reference, unqueue_all_signals is defined on the same file here: https://github.com/enoodle/manageiq-providers-kubernetes/blob/6bf8ad2a024c430207085e481c8e7c87643ed613/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L308

enoodle · 2017-08-23T07:01:51Z

@moolitayer @simon3z PTAL

simon3z · 2017-08-23T08:48:33Z

@enoodle in other providers I haven't found anything similar to the unqueue we're doing. Are they doing something else or I missed it?

cc @kbrock @roliveri

enoodle · 2017-08-23T08:55:25Z

@simon3z I haven't looked at other providers, I think it is essential for us because we are queuing messages with delay (deliver_on) .
We were already unqueuing messages before in the cancel signal.

moolitayer · 2017-08-23T12:09:43Z

app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb

@@ -221,6 +221,7 @@ def delete_pod
  end

  def cleanup(*args)
+    unqueue_all_signals


cancel now calls this twice?

@moolitayer Yes, but if we remove this from cancel there will be a race between the next pod_wait and the cleanup signals.

interesting, makes sense.
There should not be a race since in cancel you force the state to be cancel and a state change from cancel to pod_wait is not allowed - but now I see the bug you are fixing is the logging of an invalid state change so 👍

exactly, In this fix we are trying to avoid this mechanism that enforces the state machine.

moolitayer

LGTM

enoodle · 2017-08-23T13:11:39Z

@miq-bot add_label bug

simon3z · 2017-08-23T15:35:03Z

I am a little bit reluctant to add this additional unqueue_all_signals. Best case would have been to have it in one place. Let's wait to see if @cben can think of something or this is the best we can do.

cben · 2017-09-13T15:59:13Z

Second unqueue seems harmless to me. It's idempotent, specific to Job id (e.g. can't collide with another Job for same image).
MiqQueue performance impact: minimal, extra query only on errors, image scan jobs take a many minutes (unless one literally sets scan timeout: 1.second)

[The whole unqueuing bussiness is would have to go one day with rearch but that's way out of scope...]

Is the problem that cancel already unqueues but there are many ways to reach cleanup aka abort_job that never unqueue? BZ is specifically about https://github.com/enoodle/manageiq-providers-kubernetes/blob/6bf8ad2a024/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L263
What if we added a helper abort method that (1) unqueues (2) queues abort_job, and change everything to use that? Would that solve the problem?

Or is it that too early and it's important to unqueue when cleanup gets to run?

simon3z · 2017-09-19T22:08:47Z

What if we added a helper abort method that (1) unqueues (2) queues abort_job, and change everything to use that? Would that solve the problem?

Or is it that too early and it's important to unqueue when cleanup gets to run?

@enoodle those questions are for you.
(I am all for unifying the unqueue in a single place if possible)

enoodle · 2017-09-24T10:47:30Z

I left the unqueueing of signals only in the cleanup function and moved the cancel function to call it directly (instead of using a signal, that seems useless to me now). @cben PTAL

cben

Neat!

The job still goes to :finish state, due to cleanup running process_cancel
https://github.com/ManageIQ/manageiq/blob/a3c9213171/app/models/job.rb#L109-L114

enoodle · 2017-09-24T15:41:10Z

@moolitayer @simon3z Can you review again / merge?

moolitayer · 2017-09-25T13:23:19Z

app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb

@@ -251,8 +252,7 @@ def cancel(*_args)
    if self.state != "canceling" # ensure change of states
      signal :cancel
    else
-      unqueue_all_signals
-      queue_signal(:cancel_job)
+      cleanup


use signal like two lines above so we still pass trough the state machine after this change

@moolitayer I want not to go through the state machine to prevent the race condition that will cause the "pod_wait state not permitted on state..." error being printed. See @cben's comment for why we will move to state finish anyway.

There won't be a race since it is synchronous. it should work if you use signal(:cancel_job) due to:
https://github.com/enoodle/manageiq-providers-kubernetes/blob/b872f9eb1415345093b720790bca3a326d55975d/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L30

@moolitayer changed, PTAL

kbrock · 2017-09-25T13:54:37Z

In the near future, we will no longer support canceling messages in the queue.
Is there a way to handle this message gracefully instead of the stacktrace failure?

enoodle · 2017-09-25T14:01:00Z

In the near future, we will no longer support canceling messages in the queue.
Is there a way to handle this message gracefully instead of the stacktrace failure?

@moolitayer @kbrock The problem this is aiming to solve is not with cancelling but with timeout. because we are polling the image-inspector pod using :deliver_on with the signal, if the job is aborted/canceled then the :pod_wait signal will crush with the state machine. To prevent this we are adding the unqueue_all_signals to the cleanup function. It used to also be in the cancel function (that is going to be obsolete soon?) so I moved it from there and called cleanup directly to prevent a race situation the between two signals.

cben · 2017-09-25T15:06:35Z

@kbrock will queue_signal with deliver_on remain the mechanism for polling or is/will be there something better to use?

kbrock · 2017-09-25T15:30:57Z

Sorry, I misspoke. A databased backed MiqQueue will be around for a while and will continue to support this functionality. We are going away from the current queue, but that will be a migration and take time.

Specifically, we have been removing code that manipulates a message already in the queue. This is a performance problem and a race condition.

What we've been doing is modifying the receiving methods to know how to throw away a message that is no longer valid. Something like return if state == 'cancelled' for this example.

An example is a case where we had outstanding "delete this file" requests after a file had already been deleted. We were manipulating the queue to remove these extra messages. Instead we took a more basic approach:

  def delete_image(filename)
-   File.rm(filename)
+   File.rm(filename) if File.exist?(filename)
  end

Not sure if you can think of an equally simple way to gracefully just drop messages that are no longer valid.

enoodle · 2017-09-27T12:14:15Z

@kbrock This change should happen in the state machine, as it gets the signal first and decides if it is allowed to continue:
https://github.com/ManageIQ/manageiq/blob/master/app/models/job/state_machine.rb#L36
It will then make this change somewhat redundant, but until then I think we should unqueue the signals ourselves.

moolitayer · 2017-09-27T12:22:19Z

@enoodle can we have a test for this one? I'm asking since this item is fixing a bz.

moolitayer · 2017-09-27T13:49:04Z

We don't have to actually use MiqQueue. a unit test is fine - the test calls signal cancel (like we do in existing tests) and inserting something (call from queue_signal) to an empty miq queue. Make sure In the end the item was removed.

miq-bot · 2017-09-27T14:49:25Z

Checked commit enoodle@b9c49d0 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 🍰

enoodle · 2017-09-27T14:58:00Z

@moolitayer I added a test

simaishi · 2017-11-10T18:18:08Z

Fine backport (to manageiq repo) details:

$ git log -1
commit ef08e9f8fbb7c52fe5757bbe2b606a768ed81963
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Wed Sep 27 18:29:52 2017 +0300

    Merge pull request #99 from enoodle/scanning_job_cleanup_unqueue_signals
    
    Scanning::Job always cleanup all signals in cleanup
    (cherry picked from commit 2a8f7ae706df284c61861a789854d5e5ad62403e)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1496949

enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 076c085 to 6bf8ad2 Compare August 23, 2017 07:01

simon3z requested a review from cben August 23, 2017 08:29

moolitayer changed the title ~~Scannig::Job always cleanup all signals in cleanup~~ Scanning::Job always cleanup all signals in cleanup Aug 23, 2017

moolitayer reviewed Aug 23, 2017

View reviewed changes

moolitayer approved these changes Aug 23, 2017

View reviewed changes

miq-bot added the bug label Aug 23, 2017

enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 6bf8ad2 to b872f9e Compare September 24, 2017 10:46

cben approved these changes Sep 24, 2017

View reviewed changes

moolitayer reviewed Sep 25, 2017

View reviewed changes

enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from b872f9e to 7ea2248 Compare September 26, 2017 11:39

always cleanup all signals in cleanup

b9c49d0

enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 7ea2248 to b9c49d0 Compare September 27, 2017 14:49

moolitayer merged commit 2a8f7ae into ManageIQ:master Sep 27, 2017

moolitayer added this to the Sprint 70 Ending Oct 2, 2017 milestone Sep 27, 2017

simon3z added the fine/yes label Sep 28, 2017

simaishi added fine/backported and removed fine/yes labels Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanning::Job always cleanup all signals in cleanup #99

Scanning::Job always cleanup all signals in cleanup #99

enoodle commented Aug 23, 2017 •

edited

Loading

enoodle commented Aug 23, 2017

simon3z commented Aug 23, 2017

enoodle commented Aug 23, 2017

moolitayer Aug 23, 2017

enoodle Aug 23, 2017

moolitayer Aug 23, 2017

enoodle Aug 23, 2017

moolitayer left a comment

enoodle commented Aug 23, 2017

simon3z commented Aug 23, 2017

cben commented Sep 13, 2017

simon3z commented Sep 19, 2017 •

edited

Loading

enoodle commented Sep 24, 2017

cben left a comment

enoodle commented Sep 24, 2017

moolitayer Sep 25, 2017

enoodle Sep 25, 2017

moolitayer Sep 25, 2017

enoodle Sep 27, 2017

kbrock commented Sep 25, 2017

enoodle commented Sep 25, 2017

cben commented Sep 25, 2017

kbrock commented Sep 25, 2017 •

edited

Loading

enoodle commented Sep 27, 2017

moolitayer commented Sep 27, 2017

moolitayer commented Sep 27, 2017

miq-bot commented Sep 27, 2017

enoodle commented Sep 27, 2017

simaishi commented Nov 10, 2017

Scanning::Job always cleanup all signals in cleanup #99

Scanning::Job always cleanup all signals in cleanup #99

Conversation

enoodle commented Aug 23, 2017 • edited Loading

enoodle commented Aug 23, 2017

simon3z commented Aug 23, 2017

enoodle commented Aug 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moolitayer left a comment

Choose a reason for hiding this comment

enoodle commented Aug 23, 2017

simon3z commented Aug 23, 2017

cben commented Sep 13, 2017

simon3z commented Sep 19, 2017 • edited Loading

enoodle commented Sep 24, 2017

cben left a comment

Choose a reason for hiding this comment

enoodle commented Sep 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbrock commented Sep 25, 2017

enoodle commented Sep 25, 2017

cben commented Sep 25, 2017

kbrock commented Sep 25, 2017 • edited Loading

enoodle commented Sep 27, 2017

moolitayer commented Sep 27, 2017

moolitayer commented Sep 27, 2017

miq-bot commented Sep 27, 2017

enoodle commented Sep 27, 2017

simaishi commented Nov 10, 2017

enoodle commented Aug 23, 2017 •

edited

Loading

simon3z commented Sep 19, 2017 •

edited

Loading

kbrock commented Sep 25, 2017 •

edited

Loading