-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scanning::Job always cleanup all signals in cleanup #99
Scanning::Job always cleanup all signals in cleanup #99
Conversation
076c085
to
6bf8ad2
Compare
@moolitayer @simon3z PTAL |
@simon3z I haven't looked at other providers, I think it is essential for us because we are queuing messages with delay ( |
@@ -221,6 +221,7 @@ def delete_pod | |||
end | |||
|
|||
def cleanup(*args) | |||
unqueue_all_signals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancel now calls this twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moolitayer Yes, but if we remove this from cancel there will be a race between the next pod_wait
and the cleanup
signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, makes sense.
There should not be a race since in cancel you force the state to be cancel
and a state change from cancel
to pod_wait
is not allowed - but now I see the bug you are fixing is the logging of an invalid state change so 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, In this fix we are trying to avoid this mechanism that enforces the state machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@miq-bot add_label bug |
I am a little bit reluctant to add this additional |
Second unqueue seems harmless to me. It's idempotent, specific to Job id (e.g. can't collide with another Job for same image). [The whole unqueuing bussiness is would have to go one day with rearch but that's way out of scope...] Is the problem that Or is it that too early and it's important to unqueue when |
@enoodle those questions are for you. |
6bf8ad2
to
b872f9e
Compare
I left the unqueueing of signals only in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
- The job still goes to
:finish
state, due tocleanup
runningprocess_cancel
https://github.com/ManageIQ/manageiq/blob/a3c9213171/app/models/job.rb#L109-L114
@moolitayer @simon3z Can you review again / merge? |
@@ -251,8 +252,7 @@ def cancel(*_args) | |||
if self.state != "canceling" # ensure change of states | |||
signal :cancel | |||
else | |||
unqueue_all_signals | |||
queue_signal(:cancel_job) | |||
cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use signal
like two lines above so we still pass trough the state machine after this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moolitayer I want not to go through the state machine to prevent the race condition that will cause the "pod_wait state not permitted on state..." error being printed. See @cben's comment for why we will move to state finish
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There won't be a race since it is synchronous. it should work if you use signal(:cancel_job)
due to:
https://github.com/enoodle/manageiq-providers-kubernetes/blob/b872f9eb1415345093b720790bca3a326d55975d/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moolitayer changed, PTAL
In the near future, we will no longer support canceling messages in the queue. |
@moolitayer @kbrock The problem this is aiming to solve is not with cancelling but with timeout. because we are polling the image-inspector pod using |
@kbrock will queue_signal with deliver_on remain the mechanism for polling or is/will be there something better to use? |
Sorry, I misspoke. A databased backed Specifically, we have been removing code that manipulates a message already in the queue. This is a performance problem and a race condition. What we've been doing is modifying the receiving methods to know how to throw away a message that is no longer valid. Something like An example is a case where we had outstanding "delete this file" requests after a file had already been deleted. We were manipulating the queue to remove these extra messages. Instead we took a more basic approach: def delete_image(filename)
- File.rm(filename)
+ File.rm(filename) if File.exist?(filename)
end Not sure if you can think of an equally simple way to gracefully just drop messages that are no longer valid. |
b872f9e
to
7ea2248
Compare
@kbrock This change should happen in the state machine, as it gets the signal first and decides if it is allowed to continue: |
@enoodle can we have a test for this one? I'm asking since this item is fixing a bz. |
We don't have to actually use MiqQueue. a unit test is fine - the test calls signal cancel (like we do in existing tests) and inserting something (call from queue_signal) to an empty miq queue. Make sure In the end the item was removed. |
7ea2248
to
b9c49d0
Compare
Checked commit enoodle@b9c49d0 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
@moolitayer I added a test |
Fine backport (to manageiq repo) details:
|
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1481230
For reference,
unqueue_all_signals
is defined on the same file here: https://github.com/enoodle/manageiq-providers-kubernetes/blob/6bf8ad2a024c430207085e481c8e7c87643ed613/app/models/manageiq/providers/kubernetes/container_manager/scanning/job.rb#L308