Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision error logs #310

Merged
merged 7 commits into from
Sep 12, 2024
Merged

Provision error logs #310

merged 7 commits into from
Sep 12, 2024

Conversation

val500
Copy link
Contributor

@val500 val500 commented Jul 23, 2024

Description

This adds logging of provisioning errors in the device connector to be read by the Testflinger agent and propagated to the event log. The device connector logs errors for each phase in a file, device-connector-error.json, which is read by the agent. This JSON takes the following format:

{
    {phase}_exception_info: {
        exception_name: String
        exception_message: String,
        exception_cause: String
    } 
    ...
}

Resolved issues

Resolves https://warthogs.atlassian.net/browse/CERTTF-359

Documentation

Tests

Adds new tests to the device connector which mocks the MAAS CLI to throw an exception and adds tests to the agent which mocks the running of the provisioning command.

@val500 val500 marked this pull request as draft July 23, 2024 04:29
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's just a draft and it'll be easier to review when the previous stuff lands, but I wanted to go ahead and start taking a look. Just a few comments for now but I'll take a deeper look later as it progresses.

agent/testflinger_agent/client.py Outdated Show resolved Hide resolved
@val500 val500 force-pushed the provision_error_logs branch 2 times, most recently from c90beb3 to 6d31940 Compare August 5, 2024 22:12
@val500 val500 marked this pull request as ready for review August 6, 2024 14:15
@val500 val500 force-pushed the provision_error_logs branch from 6d31940 to d91c98f Compare August 6, 2024 19:55
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to fill in the PR details, especially doc/test sections

@val500 val500 requested a review from plars August 8, 2024 22:08
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of trying to do some testing with a real maas device using this change, I encountered something strange. I decided to see if giving it a bad image name would be enough to make it fail properly. The provisioning step definitely failed... but then it kept on going through the job as if nothing had happened, when it should have stopped at that point. I switched back to main and confirmed that main still has the correct behavior.

I haven't had a chance to run down what happened exactly, but I think we need to do some more testing and figure that out

@plars
Copy link
Collaborator

plars commented Aug 20, 2024

@val500 Ok, I looked into this a bit more and I see the problem. In init.py of the maas2 device connector (and possibly others, you will want to look at those also!), we seem to specifically intercept the ProvisioningError so that we can add a nice log message about it, but then instead of re-raising it, we log an error. IIRC this was likely intended to avoid getting a lot of extra debug output for common issues, but in this case we need to expose that back up to the exception handler that will write the output so that it gets propagated to the agent.

We don't want to show the full traceback in the user logs if we can avoid it - an nicer error message there would suffice. But we still need to write that output file with the full details.

@val500 val500 force-pushed the provision_error_logs branch from 4fd3707 to 63f4a83 Compare September 3, 2024 21:50
@val500 val500 force-pushed the provision_error_logs branch from 63f4a83 to 7dbf99a Compare September 4, 2024 14:29
@val500 val500 requested a review from a team as a code owner September 4, 2024 14:29
@val500 val500 requested a review from plars September 4, 2024 14:34
agent/testflinger_agent/agent.py Show resolved Hide resolved
agent/testflinger_agent/agent.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, this seems to be working for me now. I've tried it with the maas connector and also with some others like muxpi since those had to be altered. I also tried forcing a recovery error to ensure that this still works properly so I think it's doing the right things now. The one thing I did notice is that if the error output is REALLY long, then the UI shows up in kind of a strange way on the test-observer side. We might be able to iterate on distilling it down to the more usable message or alternatively, making it display nicer when this happens, but I think that will be for another time. It'll be more clear once we see what happens in real failure situations.
+1 thanks!

@plars plars merged commit f92a3b1 into main Sep 12, 2024
4 checks passed
@plars plars deleted the provision_error_logs branch September 12, 2024 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants