Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] TestFleetManagedUpgrade fails waiting the watcher to start #3760

Closed
AndersonQ opened this issue Nov 13, 2023 · 8 comments
Closed
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Nov 13, 2023

Flaky Test

Stack Trace

=== RUN   TestFleetManagedUpgrade
    fetcher.go:91: Using existing artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:261: Extracting artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestFleetManagedUpgrade808495215/001
    fixture.go:274: Completed extraction of artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestFleetManagedUpgrade808495215/001
    fixture.go:709: Components were not modified from the fetched artifact
    fixture.go:512: >> running binary with: [/tmp/TestFleetManagedUpgrade808495215/001/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    fetcher_artifact.go:213: Downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz
    fetcher_artifact.go:292: Downloading artifact progress 2.55%
    fetcher_artifact.go:292: Downloading artifact progress 3.65%
    fetcher_artifact.go:292: Downloading artifact progress 5.47%
    fetcher_artifact.go:292: Downloading artifact progress 8.61%
    fetcher_artifact.go:292: Downloading artifact progress 11.13%
    fetcher_artifact.go:292: Downloading artifact progress 13.82%
    fetcher_artifact.go:292: Downloading artifact progress 16.69%
    fetcher_artifact.go:292: Downloading artifact progress 19.11%
    fetcher_artifact.go:292: Downloading artifact progress 21.46%
    fetcher_artifact.go:292: Downloading artifact progress 24.54%
    fetcher_artifact.go:292: Downloading artifact progress 27.73%
    fetcher_artifact.go:292: Downloading artifact progress 30.33%
    fetcher_artifact.go:292: Downloading artifact progress 32.46%
    fetcher_artifact.go:292: Downloading artifact progress 34.98%
    fetcher_artifact.go:292: Downloading artifact progress 37.69%
    fetcher_artifact.go:292: Downloading artifact progress 40.46%
    fetcher_artifact.go:292: Downloading artifact progress 43.25%
    fetcher_artifact.go:292: Downloading artifact progress 46.17%
    fetcher_artifact.go:292: Downloading artifact progress 49.29%
    fetcher_artifact.go:292: Downloading artifact progress 53.13%
    fetcher_artifact.go:292: Downloading artifact progress 56.31%
    fetcher_artifact.go:292: Downloading artifact progress 59.49%
    fetcher_artifact.go:292: Downloading artifact progress 62.01%
    fetcher_artifact.go:292: Downloading artifact progress 65.06%
    fetcher_artifact.go:292: Downloading artifact progress 68.37%
    fetcher_artifact.go:292: Downloading artifact progress 70.62%
    fetcher_artifact.go:292: Downloading artifact progress 73.80%
    fetcher_artifact.go:292: Downloading artifact progress 76.98%
    fetcher_artifact.go:292: Downloading artifact progress 80.17%
    fetcher_artifact.go:292: Downloading artifact progress 82.68%
    fetcher_artifact.go:292: Downloading artifact progress 85.73%
    fetcher_artifact.go:292: Downloading artifact progress 88.10%
    fetcher_artifact.go:292: Downloading artifact progress 90.63%
    fetcher_artifact.go:292: Downloading artifact progress 94.45%
    fetcher_artifact.go:292: Downloading artifact progress 97.50%
    fetcher_artifact.go:259: Completed downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz
    fetcher_artifact.go:213: Downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz.sha512
    fetcher_artifact.go:259: Completed downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz.sha512
    fetcher_artifact.go:213: Downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz.asc
    fetcher_artifact.go:259: Completed downloading artifact from https://snapshots.elastic.co/8.12.0-9a986274/downloads/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz.asc
    fixture.go:261: Extracting artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestFleetManagedUpgrade808495215/002
    fixture.go:274: Completed extraction of artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestFleetManagedUpgrade808495215/002
    fixture.go:709: Components were not modified from the fetched artifact
    fixture.go:512: >> running binary with: [/tmp/TestFleetManagedUpgrade808495215/002/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    upgrade_fleet_test.go:83: Testing Elastic Agent upgrade from 8.12.0-SNAPSHOT to 8.12.0-SNAPSHOT with Fleet...
    fixture.go:512: >> running binary with: [/tmp/TestFleetManagedUpgrade808495215/001/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    fixture.go:512: >> running binary with: [/tmp/TestFleetManagedUpgrade808495215/002/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    upgrade_fleet_test.go:181: Creating Agent policy...
    upgrade_fleet_test.go:193: Creating Agent enrollment API key...
    upgrade_fleet_test.go:200: Getting default Fleet Server URL...
    upgrade_fleet_test.go:204: Installing Elastic Agent...
    fixture_install.go:99: [test TestFleetManagedUpgrade] Inside fixture install function
    fixture.go:512: >> running binary with: [/tmp/TestFleetManagedUpgrade808495215/001/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive --url https://a0e64f2a942d4d8a851b8130ad434279.fleet.us-east-1.aws.staging.foundit.no:443 --enrollment-token TFhMR3lJc0JyQW1LU2dnNkFGRng6VmN0N1NqMTVROEc1bVBsZlREZUZCUQ==]
    upgrade_fleet_test.go:220: Waiting for Agent to be correct version and healthy...
    fixture.go:512: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrade_fleet_test.go:224: Waiting for enrolled Agent status to be online...
    upgrade_fleet_test.go:232: Upgrading from version "8.12.0-SNAPSHOT-805a9c8419f1a60cc81c741e013e660d6e975757" to version "8.12.0-SNAPSHOT-3155ea9db05bc1ceacbbccc9d356f5922fa4cf87"...
    upgrade_fleet_test.go:239: Waiting for upgrade watcher to start...
    upgrade_fleet_test.go:241: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:241
        	            				/home/ubuntu/agent/testing/integration/upgrade_fleet_test.go:85
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestFleetManagedUpgrade
        	Messages:   	upgrade watcher did not start
    fixture_install.go:127: [test TestFleetManagedUpgrade] Inside fixture cleanup function
    fixture_install.go:142: collecting diagnostics; test failed
    fixture.go:512: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestFleetManagedUpgrade-diagnostics-2023-11-13T13:07:20Z.zip]
    fixture.go:512: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestFleetManagedUpgrade (409.91s)
@AndersonQ AndersonQ added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Nov 13, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2023

Looking in the state.yaml file at the upgrade details we can see it is stuck in the UPG_DOWNLOADING stage:

upgrade_details:
    action_id: ace1e80d-5a9f-4156-ae40-84db7763c6bf
    metadata: {}
    state: UPG_DOWNLOADING
    target_version: 8.12.0-SNAPSHOT

It looks like the upgrade watcher isn't starting because the download is failing so we never get to the point where the watcher is launched:

{"log.level":"info","@timestamp":"2023-11-13T20:49:52.133Z","log.origin":{"file.name":"upgrade/step_download.go","file.line":250},"message":"download attempt 7","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-11-13T20:49:52.134Z","log.origin":{"file.name":"upgrade/step_download.go","file.line":260},"message":"unable to download package: 3 errors occurred:\n\t* package '/opt/Elastic/Agent/data/elastic-agent-6645eb/downloads/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz' not found: open /opt/Elastic/Agent/data/elastic-agent-6645eb/downloads/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz: no such file or directory\n\t* fetching package failed: Get \"http://127.0.0.1:35295/downloads/beats/elastic-agent/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz\": dial tcp 127.0.0.1:35295: connect: connection refused\n\t* fetching package failed: Get \"http://127.0.0.1:35295/downloads/beats/elastic-agent/beats/elastic-agent/elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz\": dial tcp 127.0.0.1:35295: connect: connection refused\n\n; retrying (will be retry 7) in 1m25.185118863s.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

It is trying to download the artifact from localhost from what I see:

127.0.0.1:35295

This is exactly what it was configured to do looking at the pre-config.yaml

agent:
    download:
        sourceURI: http://127.0.0.1:35295/downloads/beats/elastic-agent/

The test itself could probably start watching the upgrade details information on main instead of looking for the watcher to start, that would give us a much better error message. This block could instead use the status command to poll for the UPG_WATCHING state.

// wait for the watcher to show up
t.Logf("Waiting for upgrade watcher to start...")
err = upgradetest.WaitForWatcher(ctx, 5*time.Minute, 10*time.Second)
require.NoError(t, err)
t.Logf("Upgrade watcher started")

@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2023

I also suspect there is a bug in the upgrade details here, the download failing should have put us in the upgrade failed state. Maybe we just needed to wait longer for that to happen though, will open a separate issue if I can confirm that is a problem.

@ycombinator
Copy link
Contributor

ycombinator commented Nov 14, 2023

I also suspect there is a bug in the upgrade details here, the download failing should have put us in the upgrade failed state. Maybe we just needed to wait longer for that to happen though, will open a separate issue if I can confirm that is a problem.

We retry the download step indefinitely. As such we never enter the UPG_FAILED state.

But perhaps we should enter the UPG_FAILED state right after a download attempt fails and then reset to the UPG_DOWNLOADING state right before we attempt a download again? I've put up a PR implementing this change: #3769.

@cmacknz
Copy link
Member

cmacknz commented Nov 15, 2023

We retry the download step indefinitely. As such we never enter the UPG_FAILED state.

Yeah the default timeout is 2 hours so I think we would have gotten to UPG_FAILED eventually.

// Elastic Agent binary is rather large and based on the network bandwidth it could take some time
// to download the full file. 120 minutes is a very large value, but we really want it to finish.
// The HTTP download will log progress in the case that it is taking a while to download.
transport.Timeout = 120 * time.Minute

cancelCtx, cancel := context.WithTimeout(ctx, settings.Timeout)
defer cancel()
expBo := backoff.NewExponentialBackOff()
expBo.InitialInterval = settings.RetrySleepInitDuration
boCtx := backoff.WithContext(expBo, cancelCtx)

I think in real life this is fine. I think we should keep the UPG_FAILED state as a terminal state that means the upgrade won't succeed without being retried by the user. If we reset from UPG_FAILED to UPG_DOWNLOADING it would make this inconsistent.

The user should see the UPG_DOWNLOADING error even if it isn't in the UPG_FAILED state.

I do think the UPG_DOWNLOADING needs a place we can store an error message from the last unsuccessful retry though, otherwise users won't see the actual progress.

Just looking at the following wasn't enough to determine the error, I had to look at the logs and I assume it will be similar in the UI unless something is missing from the state.yaml specifically:

upgrade_details:
    action_id: ace1e80d-5a9f-4156-ae40-84db7763c6bf
    metadata: {}
    state: UPG_DOWNLOADING
    target_version: 8.12.0-SNAPSHOT

@ycombinator
Copy link
Contributor

ycombinator commented Nov 15, 2023

I think in real life this is fine. I think we should keep the UPG_FAILED state as a terminal state that means the upgrade won't succeed without being retried by the user. If we reset from UPG_FAILED to UPG_DOWNLOADING it would make this inconsistent.

Agreed. I have closed #3769 unmerged.

I do think the UPG_DOWNLOADING needs a place we can store an error message from the last unsuccessful retry though, otherwise users won't see the actual progress.

We could add a metadata.transient_error field or a metadata.retryable_error field. The former has a broader connotation and could be used for other transient errors in the future. The latter is a bit more specific but might also be clearer as to when it would be populated. Thoughts?

@AndersonQ
Copy link
Member Author

closed, it was caused by #3724

@cmacknz
Copy link
Member

cmacknz commented Nov 15, 2023

We could add a metadata.transient_error field or a metadata.retryable_error field. The former has a broader connotation and could be used for other transient errors in the future. The latter is a bit more specific but might also be clearer as to when it would be populated. Thoughts?

I like retryable_error since I think it makes what we are doing clearer, although sometimes the error might not be retryable but we are retrying it anyway.

We could just use error and then include the time remaining in the timeout so it is clear we are going to retry until the timeout expires. We could just have error and then retry_until as the fixed deadline in UTC to make it easier to put in the details. If we want to display the duration remaining we can calculate it in Fleet or the agent status output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

4 participants