Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

Closed
ceeeekay opened this issue Jul 10, 2024 · 6 comments · Fixed by #5260
Closed
Assignees
Labels
bug Something isn't working

Comments

@ceeeekay
Copy link

ceeeekay commented Jul 10, 2024

  • Version: 8.13 and above
  • Operating System: Ubuntu 22.04
    • Install Method: Apt
  • Discuss Forum URL: Discuss Forum Post

Description
When upgrading Elastic Agent from version 8.13 or above (e.g., 8.14.1 to 8.14.2), the state.enc file is not copied to the new install directory. This causes the agent to remain in a (STARTING) Waiting for initial configuration and composable variables state.

Steps to Reproduce

  • Install Elastic Agent version 8.13 or above (e.g., 8.14.1).
  • Enroll the agent in a Fleet policy.
  • Verify the agent is running and healthy.
  • Upgrade to a higher version (e.g., 8.14.2).
  • Observe that state.enc is not copied to the new install directory, causing the agent to fail to start correctly.

Observations Post Upgrade

# elastic-agent version
Binary: 8.14.2 (build: 1738179d53e747c48af7350a0b8fe68eda1a5b31 at 2024-07-01 16:29:34 +0000 UTC)
Daemon: 8.14.2 (build: 1738179d53e747c48af7350a0b8fe68eda1a5b31 at 2024-07-01 16:29:34 +0000 UTC)

# ls -al /var/lib/elastic-agent/data/elastic-agent-*
elastic-agent-8.14.1-1348b9:
total 24
drwxr-xr-x 4 root root 4096 Jul 10 11:26 .
drwxr-xr-x 7 root root 4096 Jul 10 11:26 ..
drwx------ 2 root root 4096 Jul 10 11:26 logs
drwxr-x--- 4 root root 4096 Jul 10 11:25 run
-rw------- 1 root root 6745 Jul 10 11:26 state.enc

elastic-agent-8.14.2-173817:
total 111876
drwxr-xr-x 4 root root      4096 Jul 10 11:29 .
drwxr-xr-x 7 root root      4096 Jul 10 11:26 ..
drwxr-xr-x 5 root root      4096 Jul 10 11:26 components
-rwxr-xr-x 1 root root 114530504 Jul  2 18:54 elastic-agent
drwx------ 2 root root      4096 Jul 10 11:29 logs
-rw-r--r-- 1 root root       339 Jul  2 18:55 manifest.yaml
-rw-r--r-- 1 root root         7 Jul  2 18:55 package.version

Issue
The postinst script fails to correctly follow the symlink to the old agent directory, leading to the failure to copy state.enc. This leaves the agent in a broken state where it believes it is enrolled in Fleet but lacks the necessary state information.

Additional Notes
Upgrading from versions prior to 8.13 (e.g., from 8.12.2 to 8.14.2) successfully copies the state.enc file, indicating the issue is likely related to changes introduced in version 8.13.

@ceeeekay ceeeekay added the bug Something isn't working label Jul 10, 2024
@leehinman
Copy link
Contributor

Likely issue is at

# 0 is for rpm uninstall
upgrade|remove|failed-upgrade|abort-install|abort-upgrade|disappear|0)
if systemctl --quiet is-active elastic-agent; then
echo "stopping elastic-agent"
systemctl --quiet stop elastic-agent
fi
# delete symlink if exists
if test -L "$symlink"; then
echo "found symlink $symlink, unlink"
unlink "$symlink"
fi
;;
*)
;;
esac

short term fix would be to not remove the symlink during an upgrade, but long term I think we need to seriously evaluate if a symlink left on the system after a package has been removed is the best way to determine where configuration information is in a package manager environment.

@leehinman
Copy link
Contributor

Confirmed the following change is sufficient for the upgrade to work.

diff --git a/dev-tools/packaging/templates/linux/postrm.sh.tmpl b/dev-tools/packaging/templates/linux/postrm.sh.tmpl
index 9fb1e730c0..3cfe0cdd2c 100644
--- a/dev-tools/packaging/templates/linux/postrm.sh.tmpl
+++ b/dev-tools/packaging/templates/linux/postrm.sh.tmpl
@@ -10,7 +10,7 @@ case "$1" in
     ;;

   # 0 is for rpm uninstall
-  upgrade|remove|failed-upgrade|abort-install|abort-upgrade|disappear|0)
+  remove|disappear)
     if systemctl --quiet is-active elastic-agent; then
       echo "stopping elastic-agent"
       systemctl --quiet stop elastic-agent
@@ -21,6 +21,12 @@ case "$1" in
       unlink "$symlink"
     fi
     ;;
+  upgrade|failed-upgrade|abort-install|abort-upgrade|0)
+    if systemctl --quiet is-active elastic-agent; then
+      echo "stopping elastic-agent"
+      systemctl --quiet stop elastic-agent
+    fi
+    ;;
   *)
     ;;
 esac

@ceeeekay
Copy link
Author

Hi @leehinman,

Confirmed - works for me, thanks :)

@ceeeekay
Copy link
Author

@leehinman - An unrelated question: is there any reason the agent is shut down after an upgrade? We have our automation restart the service after an upgrade so it's not a major issue for us, but I'd expect it to be restarted rather than stopped.

@blakerouse
Copy link
Contributor

Confirmed that this is still an issue on 8.16.0. Install 8.14.3, enrolled into Fleet, and then upgraded to 8.16.0. Had to manually restart the elastic-agent service which also doesn't make sense and then it is stuck here:

┌─ fleet
│  └─ status: (STARTING) 
└─ elastic-agent
   └─ status: (STARTING) Waiting for initial configuration and composable variables

@blakerouse
Copy link
Contributor

I have identified the issue and determined that upgrading of DEB (I haven't done any testing of the RPM yet) has never worked properly if the Elastic Agent is enrolled into Fleet.

The order of which debian runs the maintainer scripts is not in the order that they Elastic Agent built the maintainer scripts to run. The current implementation makes the assumption that postinstall script will run before the older version of the package is removed, but that is not the case. Because of this postinstall has nothing to copy as the state files have already been removed.

It is possible to fix the issue in 8.16+ and then all upgrades from 8.16+ to a later version work correctly. This is not a great way of fixing this issue, because it causes already installed pre-8.16 versions to not be upgradable.

I am working on a solution where a preinstall script is used, which is executed before the extraction of the new version of the package but before the old version is removed. It will copy the state files into the new installation directory to ensure that they are not removed before. I have this working, but I am still seeing an issue where the Elastic Agent is still failing to re-connect to Fleet. It is using the saved policy from the state.enc to ship data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants