-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add guard against NodeStatus. Fixes #11102 #11451
fix: add guard against NodeStatus. Fixes #11102 #11451
Conversation
Signed-off-by: isubasinghe <isitha@pipekit.io> fix: get rid of return for Set, remove logs Signed-off-by: isubasinghe <isitha@pipekit.io> fix: has returns correct value Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove debug logs Signed-off-by: isubasinghe <isitha@pipekit.io> fix: ensure tests pass Signed-off-by: isubasinghe <isitha@pipekit.io> fix: restore back to /bin/bash Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove logging Signed-off-by: isubasinghe <isitha@pipekit.io> fix: remove logrus and replace with log Signed-off-by: isubasinghe <isitha@pipekit.io> fix: replace panic with errors Signed-off-by: isubasinghe <isitha@pipekit.io> fix: add comments and use Get for helper fn Signed-off-by: isubasinghe <isitha@pipekit.io> fix: always diagnose as failed when shutdown Signed-off-by: Isitha Subasinghe <isitha@pipekit.io> fix: remove error return in taskresult reconcilation Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
fb870fc
to
92a903e
Compare
cd84d98
to
d26a57c
Compare
Simple changes, just makes error handling mandatory. We encountered too many random failing container sets without these fixes. Now invalid access errors are just bubbled up, instead of relying on default initialised structs. |
Blocking #11493 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Simple change, but so many places needed changing; nice work getting through them!
For my own edification, did the edge case bugs primarily happen on retried workflows? Reading through this code, it seems that retried workflows are the only place where Nodes get deleted?
Please correct me if I'm wrong, just trying to understand when/how this class of bugs would happen
@agilgur5 some areas of the code weren't checking if the NodeStatus actually existed, thanks to Go's (in)sane way of dealing with missing entries (zero initialised), we were getting valid structs. x := make(map[string]NodeStatus)
assert x["blah"] == DefaultInitialisedNodeStatus() // function doesn't actually exist, just making a point This didn't just happen for retried workflows, for some reason workflows with container sets pretty much randomly failed all the time. I believe this was because some code path was falsely relying on the default initalised values. This change now forces you to explicitly handle the case where a NodeStatus is missing. |
Yea I understood the purpose of the PR. Another Go gotcha 😕
This is the part I was curious about. As I would think the only time an uninitialized value would be accessed would be if the key previously existed, i.e. it was deleted. Otherwise, I would think that the Nodes map would only grow (until it is GC'd), and so a non-existent key should never occur. |
This is a good question, unfortunately one I do not have a definitive answer to. I believe that the codebase is somewhat loosely consistent, code paths are allowed to fail as long as eventually they succeed. That is my educated guess anyway. |
Gotcha. Was hoping you might know. Guess we still have to watch out for the root cause somewhere (which may give this new error now at least!). If it's not just impacting retries, there might be a thread safety issue somewhere 🤔
I know that feeling all too well 😅 Like to use these types of fixes as great examples that amount of code is often not the most impactful contribution! |
Jotting down some notes here as I attempted to find the root cause:
|
The merge with argoproj#11451 reverted this, so this commit is just to reinstate that. The tests included in argoproj#11379 failed to catch this, I've raised argoproj#12129 for this, but in the interests of matching the documentation and kubecon next week I'm putting this PR in now. Fixes argoproj#12117 Signed-off-by: Alan Clucas <alan@clucas.org>
Fixes #11102 and also issues in the same class (such as #10285)
Motivation
Adds a guard to all NodeStatus accesses, this enforces users to check the HashMap before using a value in it.
This is needed because the HashMap will return a default initialised NodeStatus. Default initialised NodeStatus have triggered edge case bugs for our customers. This fix existed internally in Pipekit for a while now and was able to solve the customers issues with container sets.
Modifications
Fairly simple changes were added.
Verification