Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanning::Job always cleanup all signals in cleanup #99

Merged

Conversation

@enoodle enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 076c085 to 6bf8ad2 Compare August 23, 2017 07:01
@enoodle
Copy link
Author

enoodle commented Aug 23, 2017

@moolitayer @simon3z PTAL

@simon3z simon3z requested a review from cben August 23, 2017 08:29
@simon3z
Copy link
Contributor

simon3z commented Aug 23, 2017

@enoodle in other providers I haven't found anything similar to the unqueue we're doing. Are they doing something else or I missed it?

cc @kbrock @roliveri

@enoodle
Copy link
Author

enoodle commented Aug 23, 2017

@simon3z I haven't looked at other providers, I think it is essential for us because we are queuing messages with delay (deliver_on) .
We were already unqueuing messages before in the cancel signal.

@moolitayer moolitayer changed the title Scannig::Job always cleanup all signals in cleanup Scanning::Job always cleanup all signals in cleanup Aug 23, 2017
@@ -221,6 +221,7 @@ def delete_pod
end

def cleanup(*args)
unqueue_all_signals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel now calls this twice?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moolitayer Yes, but if we remove this from cancel there will be a race between the next pod_wait and the cleanup signals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, makes sense.
There should not be a race since in cancel you force the state to be cancel and a state change from cancel to pod_wait is not allowed - but now I see the bug you are fixing is the logging of an invalid state change so 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, In this fix we are trying to avoid this mechanism that enforces the state machine.

Copy link

@moolitayer moolitayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@enoodle
Copy link
Author

enoodle commented Aug 23, 2017

@miq-bot add_label bug

@miq-bot miq-bot added the bug label Aug 23, 2017
@simon3z
Copy link
Contributor

simon3z commented Aug 23, 2017

I am a little bit reluctant to add this additional unqueue_all_signals. Best case would have been to have it in one place. Let's wait to see if @cben can think of something or this is the best we can do.

@cben
Copy link
Contributor

cben commented Sep 13, 2017

Second unqueue seems harmless to me. It's idempotent, specific to Job id (e.g. can't collide with another Job for same image).
MiqQueue performance impact: minimal, extra query only on errors, image scan jobs take a many minutes (unless one literally sets scan timeout: 1.second)

[The whole unqueuing bussiness is would have to go one day with rearch but that's way out of scope...]

Is the problem that cancel already unqueues but there are many ways to reach cleanup aka abort_job that never unqueue? BZ is specifically about https://github.com/enoodle/manageiq-providers-kubernetes/blob/6bf8ad2a024/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L263
What if we added a helper abort method that (1) unqueues (2) queues abort_job, and change everything to use that? Would that solve the problem?

Or is it that too early and it's important to unqueue when cleanup gets to run?

@simon3z
Copy link
Contributor

simon3z commented Sep 19, 2017

What if we added a helper abort method that (1) unqueues (2) queues abort_job, and change everything to use that? Would that solve the problem?

Or is it that too early and it's important to unqueue when cleanup gets to run?

@enoodle those questions are for you.
(I am all for unifying the unqueue in a single place if possible)

@enoodle enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 6bf8ad2 to b872f9e Compare September 24, 2017 10:46
@enoodle
Copy link
Author

enoodle commented Sep 24, 2017

I left the unqueueing of signals only in the cleanup function and moved the cancel function to call it directly (instead of using a signal, that seems useless to me now). @cben PTAL

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@enoodle
Copy link
Author

enoodle commented Sep 24, 2017

@moolitayer @simon3z Can you review again / merge?

@@ -251,8 +252,7 @@ def cancel(*_args)
if self.state != "canceling" # ensure change of states
signal :cancel
else
unqueue_all_signals
queue_signal(:cancel_job)
cleanup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use signal like two lines above so we still pass trough the state machine after this change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moolitayer I want not to go through the state machine to prevent the race condition that will cause the "pod_wait state not permitted on state..." error being printed. See @cben's comment for why we will move to state finish anyway.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moolitayer changed, PTAL

@kbrock
Copy link
Member

kbrock commented Sep 25, 2017

In the near future, we will no longer support canceling messages in the queue.
Is there a way to handle this message gracefully instead of the stacktrace failure?

@enoodle
Copy link
Author

enoodle commented Sep 25, 2017

In the near future, we will no longer support canceling messages in the queue.
Is there a way to handle this message gracefully instead of the stacktrace failure?

@moolitayer @kbrock The problem this is aiming to solve is not with cancelling but with timeout. because we are polling the image-inspector pod using :deliver_on with the signal, if the job is aborted/canceled then the :pod_wait signal will crush with the state machine. To prevent this we are adding the unqueue_all_signals to the cleanup function. It used to also be in the cancel function (that is going to be obsolete soon?) so I moved it from there and called cleanup directly to prevent a race situation the between two signals.

@cben
Copy link
Contributor

cben commented Sep 25, 2017

@kbrock will queue_signal with deliver_on remain the mechanism for polling or is/will be there something better to use?

@kbrock
Copy link
Member

kbrock commented Sep 25, 2017

Sorry, I misspoke. A databased backed MiqQueue will be around for a while and will continue to support this functionality. We are going away from the current queue, but that will be a migration and take time.

Specifically, we have been removing code that manipulates a message already in the queue. This is a performance problem and a race condition.

What we've been doing is modifying the receiving methods to know how to throw away a message that is no longer valid. Something like return if state == 'cancelled' for this example.

An example is a case where we had outstanding "delete this file" requests after a file had already been deleted. We were manipulating the queue to remove these extra messages. Instead we took a more basic approach:

  def delete_image(filename)
-   File.rm(filename)
+   File.rm(filename) if File.exist?(filename)
  end

Not sure if you can think of an equally simple way to gracefully just drop messages that are no longer valid.

@enoodle enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from b872f9e to 7ea2248 Compare September 26, 2017 11:39
@enoodle
Copy link
Author

enoodle commented Sep 27, 2017

@kbrock This change should happen in the state machine, as it gets the signal first and decides if it is allowed to continue:
https://github.com/ManageIQ/manageiq/blob/master/app/models/job/state_machine.rb#L36
It will then make this change somewhat redundant, but until then I think we should unqueue the signals ourselves.

@moolitayer
Copy link

@enoodle can we have a test for this one? I'm asking since this item is fixing a bz.

@moolitayer
Copy link

We don't have to actually use MiqQueue. a unit test is fine - the test calls signal cancel (like we do in existing tests) and inserting something (call from queue_signal) to an empty miq queue. Make sure In the end the item was removed.

@enoodle enoodle force-pushed the scanning_job_cleanup_unqueue_signals branch from 7ea2248 to b9c49d0 Compare September 27, 2017 14:49
@miq-bot
Copy link
Member

miq-bot commented Sep 27, 2017

Checked commit enoodle@b9c49d0 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 🍰

@enoodle
Copy link
Author

enoodle commented Sep 27, 2017

@moolitayer I added a test

@moolitayer moolitayer merged commit 2a8f7ae into ManageIQ:master Sep 27, 2017
@moolitayer moolitayer added this to the Sprint 70 Ending Oct 2, 2017 milestone Sep 27, 2017
@simaishi
Copy link

Fine backport (to manageiq repo) details:

$ git log -1
commit ef08e9f8fbb7c52fe5757bbe2b606a768ed81963
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Wed Sep 27 18:29:52 2017 +0300

    Merge pull request #99 from enoodle/scanning_job_cleanup_unqueue_signals
    
    Scanning::Job always cleanup all signals in cleanup
    (cherry picked from commit 2a8f7ae706df284c61861a789854d5e5ad62403e)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1496949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants