Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upgrade Details] Ensure details report UPG_WATCHING for the entire time that the upgrade is being watched #3827

Merged
merged 29 commits into from
Dec 19, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Nov 27, 2023

What does this PR do?

This PR fixes the upgrade details such that they are in the UPG_WATCHING state the entire time the Agent upgrade is being watched by the Upgrade Watcher.

Why is it important?

So the state of the upgrade is accurately reported.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool Bug was never released.
  • I have added an integration test or an E2E test

How to test this PR locally

  1. Build Elastic Agent from this PR but give it a lower-than-current version number. This will be the starting (pre-upgrade) version of the Agent.
    AGENT_PACKAGE_VERSION=8.11.0 EXTERNAL=true SNAPSHOT=true PLATFORMS=darwin/arm64 PACKAGES=targz mage package
    
  2. Make a no-op commit. Without this the Agent upgrade will not succeed.
  3. Build Elastic Agent again. This will be the target (post-upgrade) version of the Agent.
    EXTERNAL=true SNAPSHOT=true PLATFORMS=darwin/arm64 PACKAGES=targz mage package
    
  4. Install the starting version of the Agent.
  5. Upgrade to the target version of the Agent.
    sudo elastic-agent upgrade 8.12.0-SNAPSHOT --source-uri file:///Users/shaunak/development/github/elastic-agent/build/distributions --skip-verify
    
  6. Check that the Agent status reports upgrade details with a state of UPG_WATCHING.
    sudo elastic-agent status --output json | jq '.upgrade_details'
    {
      "target_version": "8.12.0",
      "state": "UPG_WATCHING",
      "metadata": {}
    }
    
  7. Check the Agent logs and verify the upgrade details states are in order. In particular, make sure that chronologically, we see UPG_WATCHING after UPG_RESTARTING.
    sudo grep -R -h --include=\*.ndjson UPG_ /Library/Elastic/Agent | jq -c -s 'sort_by(.["@timestamp"]) | .[]'
    {"log.level":"info","@timestamp":"2023-11-28T11:13:45.530Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_REQUESTED","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:13:45.531Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_DOWNLOADING","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:13:45.729Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_EXTRACTING","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:13:52.308Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:13:52.313Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:13:52.320Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0-SNAPSHOT","state":"UPG_RESTARTING","metadata":{}},"ecs.version":"1.6.0"}
    {"log.level":"info","@timestamp":"2023-11-28T11:14:00.374Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":499},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.12.0","state":"UPG_WATCHING","metadata":{}},"ecs.version":"1.6.0"}
    
  8. Wait until the Upgrade Watcher has finished running.
    pgrep -f 'elastic-agent watch' | wc -l    # should report 0 eventually
    
  9. Check the Agent version and verify that it's the target version.
    sudo elastic-agent version
    
    Binary: 8.12.0-SNAPSHOT (build: 80bb6a61369c20e054478a73c6c866aadfcc52b1 at 2023-11-27 23:38:37 +0000 UTC)
    Daemon: 8.12.0-SNAPSHOT (build: 80bb6a61369c20e054478a73c6c866aadfcc52b1 at 2023-11-27 23:38:37 +0000 UTC)
    
  10. Cleanup: revert/remove the no-op commit from step 2.

Related issues

@ycombinator ycombinator force-pushed the upgrade-details-fix-upg-watching branch from 0ee25d4 to b1f5e51 Compare November 28, 2023 11:23
@ycombinator ycombinator marked this pull request as ready for review November 28, 2023 16:02
@ycombinator ycombinator requested a review from a team as a code owner November 28, 2023 16:02
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Comment on lines 120 to 122
// - the marker was just created and the upgrade is about to start
// (marker.details.state should not be empty), or
// - the upgrade was rolled back (marker.details.state should be UPG_ROLLBACK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a third option we haven't considered? The case where the agent is restarted in the middle of an upgrade, for example when it is in the UPGRADE_EXTRACTING state. The obvious example would be the host system powering off.

What happens in this case? What is reported in the upgrade details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just rolled back, is the previous version of the agent guaranteed to be next to the currently running agent in the data path? That is is the state of the filesystem always something like:

data/
  elastic-agent-current/
  elastic-agent-next/

If this is true, can we look to see if there's another agent in the data directory next to us to detect if there was a roll back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will likely have both versions next to each other if the upgrade is interrupted immediately after the artifact is extracted, so that may not be 100% either.

One more thing, in the case that the host system powers off we should start calling https://pkg.go.dev/os#File.Sync on the marker file. It doesn't look like we do this today.

// On non-Windows platforms, writeMarkerFile simply writes the marker file.
// See marker_access_windows.go for behavior on Windows platforms.
func writeMarkerFile(markerFile string, markerBytes []byte) error {
return os.WriteFile(markerFilePath(), markerBytes, 0600)
}

This might only be worth doing arounds critical transitions, like when the agent process re-execs or the watcher first starts up.

Copy link
Contributor Author

@ycombinator ycombinator Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, in the case that the host system powers off we should start calling https://pkg.go.dev/os#File.Sync on the marker file...

I've implemented the fsync change in its own PR, since it's not strictly related to this PR here: #3836

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a third option we haven't considered? The case where the agent is restarted in the middle of an upgrade, for example when it is in the UPGRADE_EXTRACTING state. The obvious example would be the host system powering off.

What happens in this case? What is reported in the upgrade details?

It depends on whether the state transition happens before or after the upgrade marker file comes into existence.

The following states occur before the upgrade marker file comes into existence and, as such, are never persisted in it: UPG_REQUESTED, UPG_SCHEDULED, UPG_DOWNLOADING, UPG_EXTRACTING. Additionally, the UPG_RESTARTING state is also currently not being persisted to the upgrade marker file, mostly because of where this state transition happens in the code vs. where the upgrade marker file is being created. With some refactoring, we could start persisting this state to the upgrade marker file as well. So if the Agent were to restart during one of these states, the upgrade details that are stored in the Coordinator state (and from there sent to Fleet) would get reset to nothing and the upgrade state would be lost.

The following states occur right before or after the upgrade marker file comes into existence and, as such, do get persisted to it: UPG_REPLACING, UPG_WATCHING, UPG_ROLLBACK. So if the Agent were to restart during one of these states, the upgrade details from the upgrade marker would be restored to the Coordinator state (and from there sent to Fleet).

We may want to consider either persisting upgrade details in their own file throughout the upgrade process OR creating the upgrade marker file at the start of the upgrade process instead of where it's being created now (right before the Upgrade Watcher is invoked from the old Agent).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just rolled back, is the previous version of the agent guaranteed to be next to the currently running agent in the data path? That is is the state of the filesystem always something like:

data/
  elastic-agent-current/
  elastic-agent-next/

If this is true, can we look to see if there's another agent in the data directory next to us to detect if there was a roll back?

Two folders will exist the moment we go past the UPG_EXTRACTING state and, yes, two folders will also exist when we are in the UPG_ROLLBACK state. So I'm not sure if checking if the number of folders > 1 is sufficient to determine if we're about to upgrade or if we've just rolled back.

But I think there might be another solution to detecting if we're about to upgrade: the code that creates the Upgrade Marker file runs in the same process as the code that's watching the Upgrade Marker file for changes. As such, the former code can communicate to the latter in memory that we're about to upgrade. In the case of a rollback, this communication will not happen.

Let me explore this solution in a separate PR as it's not strictly related to this PR here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think there might be another solution to detecting if we're about to upgrade: the code that creates the Upgrade Marker file runs in the same process as the code that's watching the Upgrade Marker file for changes. As such, the former code can communicate to the latter in memory that we're about to upgrade. In the case of a rollback, this communication will not happen.

Let me explore this solution in a separate PR as it's not strictly related to this PR here.

#3837

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following states occur right before or after the upgrade marker file comes into existence and, as such, do get persisted to it: UPG_REPLACING, UPG_WATCHING, UPG_ROLLBACK. So if the Agent were to restart during one of these states, the upgrade details from the upgrade marker would be restored to the Coordinator state (and from there sent to Fleet).

What happens to the upgrade action if an upgrade is interrupted after it is started but before it is completed? Does the agent start over from the beginning? Does it acknowledge it as if the upgrade had happened if though it didn't? Does it never get acknowledged?

What does the watcher do if it starts up, sees an upgrade marker, but the version of the agent that is currently running isn't the version that should be running?

It would be surprising for a user to see an upgrade stuck in UPG_REPLACING for example. If the upgrade is essentially aborted by a host reboot then UPG_ROLLBACK could be considered the correct state. The UPG_WATCHING always clears itself when the watcher stops but I'm not sure what happens in this state in this situation today.

I think the ideal thing to happen in this situation is that the upgrade restarts from the beginning when the agent host system comes back online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the upgrade action if an upgrade is interrupted after it is started but before it is completed? Does the agent start over from the beginning? Does it acknowledge it as if the upgrade had happened if though it didn't? Does it never get acknowledged?

Looking at the code, once the Upgrade Marker has been created, if the Agent restarts, it will acknowledge the upgrade with Fleet even if the upgrade may not have completed. In fact, it's entirely possible that Agent acknowledges the upgrade with Fleet while the Upgrade Watcher is still running and then the Upgrade Watcher decides to roll back the Agent. As far as Fleet is concerned, the upgrade would've been reported as successful, and then within 10 minutes, the previous version of Agent would start showing again without any explanation as to why.

With Upgrade Details, in the above rollback scenario, the Upgrade Watcher will record the state as UPG_ROLLBACK in the Upgrade Marker file, which will get picked up by the main Agent process and sent to Fleet.

What does the watcher do if it starts up, sees an upgrade marker, but the version of the agent that is currently running isn't the version that should be running?

Again, looking at the code...

First, if the Upgrade Watcher was running before an interruption stopped/killed it, and then the Upgrade Watcher was restarted, it will exit immediately because the watcher lock file, watcher.lock, will still exist. In this case, there are two possibilities as to which state will be reported in Upgrade Details to Fleet:

  • if the Upgrade Marker contains Upgrade Details, the state recorded in it will be reported to Fleet. This should be the UPG_REPLACING state as it's the last state that's persisted to the Upgrade Marker before the old Agent's upgrade code restarts the new Agent.
  • if, for some reason, the Upgrade Marker does not contain Upgrade Details and the version of the running Agent is the same as the previous version recorded in the Upgrade Marker, UPG_ROLLBACK will be reported to Fleet.

If the Upgrade Watcher wasn't running yet when the upgrade process was interrupted, the Upgrade Watcher will start monitoring the Agent regardless of what version of Agent is currently running or what version is recorded in the Upgrade Marker file. In this case, UPG_WATCHING will be reported to Fleet. If the watch succeeds, the Upgrade Details will stop being reported to Fleet; note that the version of Agent being reported to Fleet will already be the new one in this case. If the watch fails and Agent has to be rolled back, UPG_ROLLBACK will be reported to Fleet.

I think the ideal thing to happen in this situation is that the upgrade restarts from the beginning when the agent host system comes back online.

Agreed but I think restarting the upgrade from the beginning after a crash is beyond the scope of this PR so I've created #3860 to track this improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree fixing this is outside the scope of this PR given your explanation.

I also think we acknowledge the upgrade too early since it isn't synchronized with the watcher but that is also out of scope.

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it using the air-gapped test, it works. There is still what I believe is something similar to #3821, but it's because it's upgrading from an snapshot to anoter. Thus all good for this PR

// error context added by checkUpgradeDetailsState
return err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want to check for a final state of the upgrade like COMPLETED or FAILED ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small comment about checking the final state of an upgrade before asserting that the upgrade details disappear

Copy link
Contributor

mergify bot commented Nov 30, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b upgrade-details-fix-upg-watching upstream/upgrade-details-fix-upg-watching
git merge upstream/main
git push upstream upgrade-details-fix-upg-watching

@ycombinator ycombinator force-pushed the upgrade-details-fix-upg-watching branch 2 times, most recently from bf1767d to 9925381 Compare December 1, 2023 13:57
Copy link
Contributor

mergify bot commented Dec 1, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b upgrade-details-fix-upg-watching upstream/upgrade-details-fix-upg-watching
git merge upstream/main
git push upstream upgrade-details-fix-upg-watching

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested manually and confirmed it works, thanks!

@ycombinator ycombinator force-pushed the upgrade-details-fix-upg-watching branch from 49883a3 to 7344cd7 Compare December 5, 2023 18:58
@cmacknz cmacknz added backport-v8.12.0 Automated backport with mergify and removed backport-skip labels Dec 6, 2023
@ycombinator ycombinator force-pushed the upgrade-details-fix-upg-watching branch from 5cb41f4 to b040d29 Compare December 6, 2023 16:22
@ycombinator ycombinator force-pushed the upgrade-details-fix-upg-watching branch from 05cb012 to 67cf265 Compare December 14, 2023 17:52
@ycombinator ycombinator enabled auto-merge (squash) December 15, 2023 02:02
@ycombinator ycombinator disabled auto-merge December 15, 2023 02:06
@ycombinator
Copy link
Contributor Author

ycombinator commented Dec 15, 2023

All upgrade-related integration tests are passing now. But all Endpoint-related integration tests are failing now with the same symptom:

Endpoint component or units are not healthy

[EDIT] Downloaded a failing test's diagnostic and looked at the Endpoint logs within it. The earliest errors in the logs say this:

{"@timestamp":"2023-12-15T13:45:00.559583771Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":3049,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:3049 HTTP code 401: Unauthorized","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.559604491Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":3057,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:3057 Message: {\"statusCode\":401,\"error\":\"ErrNoAuthHeader\",\"message\":\"no authorization header\"}","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.559618691Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":3088,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:3088 Failed to download artifact endpoint-hostisolationexceptionlist-linux-v1 - HTTP non-200 code received","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.560412691Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":728,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:728 Failed to initialize artifact, identifier: endpoint-hostisolationexceptionlist-linux-v1, reason: HTTP non-200 code received","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.560423771Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":1535,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:1535 All artifacts are being rejected because endpoint-hostisolationexceptionlist-linux-v1 is invalid","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.560434131Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":1564,"name":"Artifacts.cpp"}}},"message":"Artifacts.cpp:1564 Failed to process artifact manifest","process":{"pid":26341,"thread":{"id":26377}}}

Right before those errors, there are these two logs about proxy URLs; not sure if those are relevant to the errors or not:

{"@timestamp":"2023-12-15T13:45:00.384293332Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":179,"name":"Proxy.cpp"}}},"message":"Proxy.cpp:179  Global manifest override proxy URL: not set","process":{"pid":26341,"thread":{"id":26377}}}
{"@timestamp":"2023-12-15T13:45:00.384301412Z","agent":{"id":"13c4c478-4006-46dc-8cef-28489a279fc2","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"info","origin":{"file":{"line":179,"name":"Proxy.cpp"}}},"message":"Proxy.cpp:179  User manifest override proxy URL: not set","process":{"pid":26341,"thread":{"id":26377}}}

Failures don't seem related to the changes in this PR.

@jlind23
Copy link
Contributor

jlind23 commented Dec 19, 2023

buildkite test this

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
55.6% 55.6% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@ycombinator ycombinator merged commit ad7e1b5 into elastic:main Dec 19, 2023
@ycombinator ycombinator deleted the upgrade-details-fix-upg-watching branch December 19, 2023 14:22
mergify bot pushed a commit that referenced this pull request Dec 19, 2023
… time that the upgrade is being watched (#3827)

* Don't set upgrade details when creating upgrade marker

* Set UPG_WATCHING state right before starting to watch upgrade

* Log upgrade details whenever they're set on the coordinator

* Fix logging location

* Revert "Don't set upgrade details when creating upgrade marker"

This reverts commit 6821832.

* Fix logic with assuming UPG_ROLLBACK state

* Add FIXME

* Correctly observe upgrade details changes

* Update unit test

* Include upgrade details in status output

* Check upgrade details state before and after upgrade watcher starts

* Check that upgrade details have been cleared out upon successful upgrade

* Update unit test

* Fixing up upgrade integration tests

* Add unit test + fix details object being used

* Define AgentStatusOutput.IsZero() and use it

* Make sure Marker Watcher accounts for `UPG_COMPLETED` state

* Fix location of assertion

* Fix error message

* Join errors for wrapping

* Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing

* Cast string to details.State

* Remove version override debugging

* Wrap bugfix assertions in version checks

* Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option

* Call option function

* Debugging

* Fixing version check logic

* Remove debugging statements

(cherry picked from commit ad7e1b5)
ycombinator added a commit that referenced this pull request Dec 19, 2023
… time that the upgrade is being watched (#3827) (#3927)

* Don't set upgrade details when creating upgrade marker

* Set UPG_WATCHING state right before starting to watch upgrade

* Log upgrade details whenever they're set on the coordinator

* Fix logging location

* Revert "Don't set upgrade details when creating upgrade marker"

This reverts commit 6821832.

* Fix logic with assuming UPG_ROLLBACK state

* Add FIXME

* Correctly observe upgrade details changes

* Update unit test

* Include upgrade details in status output

* Check upgrade details state before and after upgrade watcher starts

* Check that upgrade details have been cleared out upon successful upgrade

* Update unit test

* Fixing up upgrade integration tests

* Add unit test + fix details object being used

* Define AgentStatusOutput.IsZero() and use it

* Make sure Marker Watcher accounts for `UPG_COMPLETED` state

* Fix location of assertion

* Fix error message

* Join errors for wrapping

* Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing

* Cast string to details.State

* Remove version override debugging

* Wrap bugfix assertions in version checks

* Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option

* Call option function

* Debugging

* Fixing version check logic

* Remove debugging statements

(cherry picked from commit ad7e1b5)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
cmacknz pushed a commit that referenced this pull request Jan 17, 2024
… time that the upgrade is being watched (#3827) (#3927)

* Don't set upgrade details when creating upgrade marker

* Set UPG_WATCHING state right before starting to watch upgrade

* Log upgrade details whenever they're set on the coordinator

* Fix logging location

* Revert "Don't set upgrade details when creating upgrade marker"

This reverts commit 6821832.

* Fix logic with assuming UPG_ROLLBACK state

* Add FIXME

* Correctly observe upgrade details changes

* Update unit test

* Include upgrade details in status output

* Check upgrade details state before and after upgrade watcher starts

* Check that upgrade details have been cleared out upon successful upgrade

* Update unit test

* Fixing up upgrade integration tests

* Add unit test + fix details object being used

* Define AgentStatusOutput.IsZero() and use it

* Make sure Marker Watcher accounts for `UPG_COMPLETED` state

* Fix location of assertion

* Fix error message

* Join errors for wrapping

* Debugging why TestStandaloneDowngradeToSpecificSnapshotBuild is failing

* Cast string to details.State

* Remove version override debugging

* Wrap bugfix assertions in version checks

* Introduce upgradetest.WithDisableUpgradeWatcherUpgradeDetailsCheck option

* Call option function

* Debugging

* Fixing version check logic

* Remove debugging statements

(cherry picked from commit ad7e1b5)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
6 participants