Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

ceeeekay · 2024-07-10T02:06:30Z

Version: 8.13 and above
Operating System: Ubuntu 22.04
- Install Method: Apt
Discuss Forum URL: Discuss Forum Post

Description
When upgrading Elastic Agent from version 8.13 or above (e.g., 8.14.1 to 8.14.2), the state.enc file is not copied to the new install directory. This causes the agent to remain in a (STARTING) Waiting for initial configuration and composable variables state.

Steps to Reproduce

Install Elastic Agent version 8.13 or above (e.g., 8.14.1).
Enroll the agent in a Fleet policy.
Verify the agent is running and healthy.
Upgrade to a higher version (e.g., 8.14.2).
Observe that state.enc is not copied to the new install directory, causing the agent to fail to start correctly.

Observations Post Upgrade

# elastic-agent version
Binary: 8.14.2 (build: 1738179d53e747c48af7350a0b8fe68eda1a5b31 at 2024-07-01 16:29:34 +0000 UTC)
Daemon: 8.14.2 (build: 1738179d53e747c48af7350a0b8fe68eda1a5b31 at 2024-07-01 16:29:34 +0000 UTC)

# ls -al /var/lib/elastic-agent/data/elastic-agent-*
elastic-agent-8.14.1-1348b9:
total 24
drwxr-xr-x 4 root root 4096 Jul 10 11:26 .
drwxr-xr-x 7 root root 4096 Jul 10 11:26 ..
drwx------ 2 root root 4096 Jul 10 11:26 logs
drwxr-x--- 4 root root 4096 Jul 10 11:25 run
-rw------- 1 root root 6745 Jul 10 11:26 state.enc

elastic-agent-8.14.2-173817:
total 111876
drwxr-xr-x 4 root root      4096 Jul 10 11:29 .
drwxr-xr-x 7 root root      4096 Jul 10 11:26 ..
drwxr-xr-x 5 root root      4096 Jul 10 11:26 components
-rwxr-xr-x 1 root root 114530504 Jul  2 18:54 elastic-agent
drwx------ 2 root root      4096 Jul 10 11:29 logs
-rw-r--r-- 1 root root       339 Jul  2 18:55 manifest.yaml
-rw-r--r-- 1 root root         7 Jul  2 18:55 package.version

Issue
The postinst script fails to correctly follow the symlink to the old agent directory, leading to the failure to copy state.enc. This leaves the agent in a broken state where it believes it is enrolled in Fleet but lacks the necessary state information.

Additional Notes
Upgrading from versions prior to 8.13 (e.g., from 8.12.2 to 8.14.2) successfully copies the state.enc file, indicating the issue is likely related to changes introduced in version 8.13.

The text was updated successfully, but these errors were encountered:

leehinman · 2024-07-11T14:18:12Z

Likely issue is at

elastic-agent/dev-tools/packaging/templates/linux/postrm.sh.tmpl

Lines 12 to 26 in ca5a07c

    
             # 0 is for rpm uninstall 
        
             upgrade|remove|failed-upgrade|abort-install|abort-upgrade|disappear|0) 
        
               if systemctl --quiet is-active elastic-agent; then 
        
                 echo "stopping elastic-agent" 
        
                 systemctl --quiet stop elastic-agent 
        
               fi 
        
               # delete symlink if exists 
        
               if test -L "$symlink"; then 
        
                 echo "found symlink $symlink, unlink" 
        
                 unlink "$symlink" 
        
               fi 
        
               ;; 
        
             *) 
        
               ;; 
        
           esac

short term fix would be to not remove the symlink during an upgrade, but long term I think we need to seriously evaluate if a symlink left on the system after a package has been removed is the best way to determine where configuration information is in a package manager environment.

leehinman · 2024-07-11T18:36:14Z

Confirmed the following change is sufficient for the upgrade to work.

diff --git a/dev-tools/packaging/templates/linux/postrm.sh.tmpl b/dev-tools/packaging/templates/linux/postrm.sh.tmpl
index 9fb1e730c0..3cfe0cdd2c 100644
--- a/dev-tools/packaging/templates/linux/postrm.sh.tmpl
+++ b/dev-tools/packaging/templates/linux/postrm.sh.tmpl
@@ -10,7 +10,7 @@ case "$1" in
     ;;

   # 0 is for rpm uninstall
-  upgrade|remove|failed-upgrade|abort-install|abort-upgrade|disappear|0)
+  remove|disappear)
     if systemctl --quiet is-active elastic-agent; then
       echo "stopping elastic-agent"
       systemctl --quiet stop elastic-agent
@@ -21,6 +21,12 @@ case "$1" in
       unlink "$symlink"
     fi
     ;;
+  upgrade|failed-upgrade|abort-install|abort-upgrade|0)
+    if systemctl --quiet is-active elastic-agent; then
+      echo "stopping elastic-agent"
+      systemctl --quiet stop elastic-agent
+    fi
+    ;;
   *)
     ;;
 esac

ceeeekay · 2024-07-11T20:05:14Z

Hi @leehinman,

Confirmed - works for me, thanks :)

ceeeekay · 2024-07-11T22:43:25Z

@leehinman - An unrelated question: is there any reason the agent is shut down after an upgrade? We have our automation restart the service after an upgrade so it's not a major issue for us, but I'd expect it to be restarted rather than stopped.

blakerouse · 2024-07-30T16:56:58Z

Confirmed that this is still an issue on 8.16.0. Install 8.14.3, enrolled into Fleet, and then upgraded to 8.16.0. Had to manually restart the elastic-agent service which also doesn't make sense and then it is stuck here:

┌─ fleet
│  └─ status: (STARTING) 
└─ elastic-agent
   └─ status: (STARTING) Waiting for initial configuration and composable variables

blakerouse · 2024-08-05T14:15:37Z

I have identified the issue and determined that upgrading of DEB (I haven't done any testing of the RPM yet) has never worked properly if the Elastic Agent is enrolled into Fleet.

The order of which debian runs the maintainer scripts is not in the order that they Elastic Agent built the maintainer scripts to run. The current implementation makes the assumption that postinstall script will run before the older version of the package is removed, but that is not the case. Because of this postinstall has nothing to copy as the state files have already been removed.

It is possible to fix the issue in 8.16+ and then all upgrades from 8.16+ to a later version work correctly. This is not a great way of fixing this issue, because it causes already installed pre-8.16 versions to not be upgradable.

I am working on a solution where a preinstall script is used, which is executed before the extraction of the new version of the package but before the old version is removed. It will copy the state files into the new installation directory to ensure that they are not removed before. I have this working, but I am still seeing an issue where the Elastic Agent is still failing to re-connect to Fleet. It is using the saved policy from the state.enc to ship data.

ceeeekay added the bug Something isn't working label Jul 10, 2024

ycombinator assigned blakerouse Jul 10, 2024

blakerouse mentioned this issue Aug 6, 2024

Fix debian packaging for upgrades #5260

Merged

4 tasks

ycombinator closed this as completed in #5260 Aug 13, 2024

mergify bot mentioned this issue Aug 13, 2024

[8.15](backport #5260) Fix debian packaging for upgrades #5291

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

ceeeekay commented Jul 10, 2024 •

edited

Loading

leehinman commented Jul 11, 2024

leehinman commented Jul 11, 2024

ceeeekay commented Jul 11, 2024

ceeeekay commented Jul 11, 2024

blakerouse commented Jul 30, 2024

blakerouse commented Aug 5, 2024

Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

Agent .deb install: state.enc not copied during Elastic Agent upgrade from 8.13 and above #5101

Comments

ceeeekay commented Jul 10, 2024 • edited Loading

leehinman commented Jul 11, 2024

leehinman commented Jul 11, 2024

ceeeekay commented Jul 11, 2024

ceeeekay commented Jul 11, 2024

blakerouse commented Jul 30, 2024

blakerouse commented Aug 5, 2024

ceeeekay commented Jul 10, 2024 •

edited

Loading