-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make exception handling explicit #5202
Make exception handling explicit #5202
Conversation
7cee29f
to
6a845c4
Compare
dfaeb56
to
0888048
Compare
0888048
to
12ffead
Compare
fe33823
to
8e41a0e
Compare
15ac27c
to
2864b51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spearheading this effort! I left some comments inline.
Beyond those, I have a few more broad concerns.
Your LOG.warning("Unhandled exception: %s", e)
rule is a bit too rigid to me. There are a bunch of places that had something like:
try:
do_something()
except Exception as e:
LOG.warning("oh no! %s", e)
recover()
that turned into something like:
try:
do_something
except ValueError as e:
LOG.warning("oh no! %s", e)
recover()
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
LOG.warning("oh no! %s", e)
recover()
I'm not sure how this is an improvement. If we're logging the exception with the old message, then why is the broad exception a problem? Sure, best practices says you should only catch exceptions as narrowly as you need to, but there are certainly use cases for catching Exception
, and as long as we're not actively swallowing exceptions, I don't see the problem...especially if the exception handling is sitting at the top of a giant call stack. In those cases it makes sense to be overly broad in our handling.
To take an example from a recent change that I approved:
cloud-init/cloudinit/net/netplan.py
Line 268 in a22d853
except Exception as e: |
I saw the Exception
here, thought about it for a second, and then decided that it was the best course of action. For two reasons:
- We're logging the exception, so any unexpected issues with still get logged.
- I quite literally don't care what happens for the purposes of setting up cloud-init. The netplan API could raise a YourServerIsLiterallyOnFireException. We'll log it, bail, and then do our own v2 rendering.
In this case, what benefit is there to narrowing the exception type and having duplicate exception handlings with one having an additional warning? Is narrowing the exceptions worth the risk that we missed one or that a new one may pop up in the future?
My other major concern is logging new warnings that will cause cloud-init to exit 2. There are a number of places (random example) where we'll get a new warning where we didn't before. In that example, it even seems there was some effort in ensuring the exception can be silenced when we want it to.
Also for future reference, the bigger a PR gets, the more that is going to fall through the cracks during review. For PRs this big, I think it'd be best to move anything not strictly part of the goal to separate PRs. There's a few refactoring-type commits (json decoding, ModuleDetails) that don't really seem to belong here. Now that they're reviewed, I'm not asking for them to be moved...more just a future thing.
cloudinit/cmd/clean.py
Outdated
@@ -149,6 +151,7 @@ def remove_artifacts(remove_logs, remove_seed=False, remove_config=None): | |||
try: | |||
runparts(settings.CLEAN_RUNPARTS_DIR) | |||
except Exception as e: | |||
LOG.warning("Unhandled exception: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with these unhandled exception rules when it comes to CLI commands. The exception is being printed/handled using the error()
call below this one, and does it in a more graceful way. I think a log message on the CLI like the following will just add more noise:
2024-05-01 18:10:39,200 - clean.py[WARNING]: Unhandled exception: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 see comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No response needed, but leaving this open because I have changes to make here.
@@ -745,8 +745,12 @@ def status_wrapper(name, args): | |||
else: | |||
try: | |||
status = json.loads(util.load_text_file(status_path)) | |||
except Exception: | |||
pass | |||
except OSError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually want to warn if a file isn't found here?
Regardless, I would probably downgrade these all to info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't happen, then either a previous stage failed to create a status.json, or something removed/modified it/corrupted it, or a cloud-init stage is getting manually run out of order(basically the same thing as this PR).
All of these sound like "unexpected / bad things" to me - wouldn't a warning make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheRealFalcon waiting for response on this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that other PR already contains a warning, right? I guess it seems like fairly large shift from "it doesn't matter if this fails because we'll create it anyway" to "we need to warn you about this even though it has no bearing on the resulting artifact".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I'll remove it.
@@ -32,7 +32,7 @@ def get_metadata( | |||
connectivity_url_data={"url": url}, | |||
): | |||
# Fetch the metadata | |||
v1 = read_metadata(url, timeout, retries, sec_between, agent) | |||
v1 = _read_metadata(url, timeout, retries, sec_between, agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I prefer to convert from bytes to strings ASAP and I'm not sure why we needed to rename the function. Neither change is bad but the changes in the file feel superfluous to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to convert from bytes to strings ASAP
generally no objections to this philosophy, however in cases like this co-locating the .decode()
with the json.loads()
could make the exception handling simpler by including the exception handler that's already required for the json.loads()
with the UnicodeDecodeError
handler. That said, we aren't actually handling this exception here, so that's not an argument for keeping it this way. I can drop the changes to this file if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can drop the changes to this file if you want.
@TheRealFalcon ping - that whole commit might be unwanted given the "convert from bytes asap" philosophy. Thoughts on how to proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a nit, so it's not a strong opinion. I'm fine keeping it this way.
4a24b71
to
3376535
Compare
Thanks for the review @TheRealFalcon! Great feedback all around.
That's fair. It made auditing the code easier to implement this way, but we don't need to be so strict. I've updated the rules in the description above to say:
I want to keep the rule list short so that it is easy to remember, while being obvious and clear enough that it doesn't require much interpretation when trying to write or review code. Let me know what you think, suggestions are welcome.
I have some concerns about that too. In the past we've allowed some behavior changes that are bug fixes. I would consider properly warning (and exit 2) about a previously unreported error condition to be a bug fix. There is risk of big unknowns that crop up from this change, but I'm happy to be slow and meticulous with this PR if that will help us build confidence about its direction and consequences.
Agreed, applying the rule feels clunky here and doesn't improve the end user's experience. Unfortunately, writing to
These are just symptoms of the codebase's haphazardous structure. Both of these points are undesirable for several reasons, but while 1) would be trivial to fix 2) would probably take some forethought and planning. I have some thoughts on this topic for how we can better structure the core codebase and how this would benefit the project in different ways, but I'll save that discussion for later. For now maybe we add a rule that says something like: Maybe someday if we feel the urge to address 1) and 2) we can change the rule to say Thoughts? |
From a high level, I think everything you said makes sense. Thanks for laying it all out. I just want to make sure I understand. By try:
do_something()
except Exception as e:
LOG.warning("oh no! %s", e)
recover() If so, +1. If not, let me know what I'm missing. If we want to have hard rules around this behavior, I think:
|
Yes, that would be just fine. I think we're on the same page about this.
+1 Agreed. I can do that in a follow-up PR
+1 I think the rules are best followed in most cases, but exceptions should be allowable. Even in this PR I break this rule in I do think that this PR a larger hurdle in risk of behavior changes than future code which follows these rules will have. That is because in future PRs, we won't have to worry so much about existing runtime exceptions which silently behave in a way that is currently expected by the user since any of those that exist in theory should get fleshed out by this PR. |
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
@TheRealFalcon I forgot to mention that I'd also be okay with patching this out of older releases to mitigate the risk of this change. If we do see new bugs shake out of cloud-init in Oracular as a result of this, the resulting bug fixes will still benefit old releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made another pass here. I know you still have the Eliminate unnecessary LOG.warning("Unhandled exception",)
piece left to do but figured this was looking at anyway.
I forgot to mention that I'd also be okay with patching this out of older releases to mitigate the risk of this change
I think that's overkill for the entire PR. These changes are spread far enough that I can foresee hitting quilt conflicts all the time if we patched it all out. That said, it might be worth separating the risky ones and patching those. In this review, I added a Rule exception or quilt patch.
comment to lines where I think we either need to make an exception to the rule or add a quilt patch for existing releases for that particular line.
LOG.debug( | ||
"User %s is not correctly configured for ssh_import_id", user | ||
) | ||
util.logexc(LOG, "Unhandled configuration for user %s", user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid warning, but we should patch it as to not accidentally break deployments that have an invalid import id but otherwise work.
cloudinit/util.py
Outdated
except IndexError: | ||
continue | ||
except Exception as e: | ||
LOG.warning("Unhandled exception: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule exception or quilt patch.
cloudinit/util.py
Outdated
except IndexError: | ||
continue | ||
except Exception as e: | ||
LOG.warning("Unhandled exception: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule exception or quilt patch.
cloudinit/util.py
Outdated
except ValueError: | ||
ts = "??" | ||
except Exception as e: | ||
LOG.warning("Unhandled exception: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule exception or quilt patch.
cloudinit/util.py
Outdated
except Exception: | ||
pass | ||
except Exception as e: | ||
LOG.warning("Unhandled exception: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quilt patch
@@ -100,24 +100,6 @@ def test_handler_creates_and_runs_bootcmd_script_with_instance_id(self): | |||
my_id + " iid-datasource-none\n", util.load_text_file(out_file) | |||
) | |||
|
|||
def test_handler_runs_bootcmd_script_with_error(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer needed? The log is gone but we should still get the exception.
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
setting to |
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
From this helper's docstring: Ubuntu cloud images previously included a 'eth0.cfg' that had hard coded content. That file would interfere with the rendered configuration if it was present. This code is no longer relevant due to the following: 1) Ubuntu uses netplan 2) This appears like an Ubuntu-specific workaround 3) Debian cloud images use netplan
Remove unnecessary arguments.
Since Python 3.6, json.loads() is capable of receiving arguments of type str, bytes, and bytearray. Cloud-init's code historically manually handled type conversion. This boilerplate is no longer necessary. Add docstring and typing to util.load_json().
Cloud-init's codebase makes extensive use of exception handlers which catch Exception and then do something based on the assumed exception type. This is bad practice and makes it possible for exceptions to be unexpectedly ignored. To remedy this, this commit updates cloud-init to make `Exception` always be an unknown exception which therefore should always produce a log of level `WARN` or higher. Require a minimum of one of the following in each Exception handler: - LOG.warning("Unhandled exception: %s", e) - util.logexc() - with a log level of WARN or higher - re-raise the exception without setting exception type when re-raising[1] Make cloud-init's code adhere to the above set of rules with the following changes: - Limit the scope where Exception is handled to only unexpected error paths. - Add new handlers for more specific exception types. - Add warning logs or increase log level to WARN in some cases. - Add typing where exceptions are returned. - Replace various Exception handlers with ProcessExecutionError. - Remove handling that does nothing. - Remove unused code which handles Exception. [1] This would likely reduce the scope of the exception and make it possible to mask an Exception.
In Python 3.3+, IOError is an alias of OSError[1]. Some call sites use one or the other or both. Standardize on the OSError. [1] https://docs.python.org/3/library/exceptions.html#IOError
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
d68eb04
to
1786f9d
Compare
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
Impetus
Following the transition to the new exit code (2) for easier user introspection of cloud-init, there was a request to audit current callsites for error conditions that are logged at too high or low of a logging level. This PR starts that effort, but only scratches the surface. Much more work needs to be done auditing log levels.
Objective
This PR is the result of an audit of exceptions of type
Exception
. This audit had a simple goal: prevent cloud-init from swallowing exceptions. We don't want exceptions such asAttributeError
,IndexError
,KeyError
,TypeError
,ZeroDivisionError
to be ignored when the exception that we expect to handle is a failed filesystem operation - this would mask important errors at runtime.Rules
To complete the objective, a set of rules was defined and implemented. They are described in one of the commit messages:
By following these rules, we can expect that every
Exception
handler will either issue its own warning or raise to another exception which will raise a warning. By doing this, silently ignoring important classes of exceptions will not happen.Any log of
WARN
would suffice for ensuring that these exceptions are logged, however standardizing this message across the codebase simplifies code auditing and makes it clear to developers when they see this message what has happened.Methodology
Inspect the code for sources of expected exceptions and add them above the
Exception
handler, and adding a properWARN
log to the existingException
exception. This allows us to flesh out any unexpected changes easily in integration testing without risking unintentional logic changes.Reviewer notes
I recommend reading the individual commit messages while reviewing. This PR carries some risk of changing cloud-init logic, but it attempts to mitigate the risk by retaining
Exception
(in most cases) while making all Exception handling into a recoverable error. I believe that the risk that this poses will (after thoroughly reviewed) be outweighed by the current risk we have of unintentionally silencing unexpected exceptions.A few things to keep in mind while reviewing:
subp.subp()
throwsOSError
OSError
is thrown by many filesystem and operating system operations. It is used liberally in this PR.LOG.warning("Unhandled...")
. Ideas are welcome for better messaging and maybe a helper function would make sense - maybe something inlog.py
?Merge type