Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] Elastic-Agent-Packaging-Linux is failed (there for e2e-testing pr support) which blocks merges - seems to be a Heartbeat container build problem #28570

Closed
EricDavisX opened this issue Oct 20, 2021 · 13 comments
Labels
blocker Team:Automation Label for the Observability productivity team

Comments

@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 20, 2021

This is from a dev via slack:
PR 'tests' are failing...
"The E2E agent linux tests are super flaky at the moment and preventing us from making imports. Is that something currently being looked at?"
citing this PR as example:
#28517

Manu helped bisect the pr failures and seemed to point towards the Heartbeat container not being available, so maybe a build problem there.

Victor noted that the e2e-testing support had been turned on (requiring that packaging) somewhat recently. PM me for slack convo link or team thread on the packaging change

image from slack:
image (5)

@EricDavisX EricDavisX added blocker Team:Elastic-Agent Label for the Agent team labels Oct 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@EricDavisX
Copy link
Contributor Author

@andresrc @jlind23 @KseniaElastic @v1v can can we do to help resolve this quick? I don't know the Beats packaging.

@EricDavisX EricDavisX changed the title [Agent] Linux-packaging (for e2e-testing pr support) is blocking PR merges, seems to be a Heartbeat container not-existing problem [Agent] Elastic-Agent-Packaging-Linux is failed (there for e2e-testing pr support) which blocks merges - seems to be a Heartbeat container build problem Oct 20, 2021
@jlind23
Copy link
Collaborator

jlind23 commented Oct 20, 2021

@andrewvc seems that this is on Synthetics side.

@EricDavisX EricDavisX removed the Team:Elastic-Agent Label for the Agent team label Oct 20, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 20, 2021
@mdelapenya
Copy link
Contributor

I'm currently bisecting this issue with the e2e-testing framework. For that, I'm basically passing the commit to the tests with this command:

For master (current released artifacts):

TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true make -C e2e/_suites/kubernetes-autodiscover functional-test

Starting the bisect with latest commit on heartbeat:

TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true GITHUB_CHECK_SHA1=af602c2b0df38bfc3fb5cfcfabcab1145b558022 ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

Will post results here.

@mdelapenya
Copy link
Contributor

mdelapenya commented Oct 20, 2021

OK, taking this commit 99ebf3e as GOOD, I'm able to see the tests passing for that commit. Starting the bisect from that:

And the result is GOOD ✅

kind delete clusters --all && \
TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true \
GITHUB_CHECK_SHA1=99ebf3e4375c4dbee0ee281889e804b13a62a463 
ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

List of commits, starting from GOOD:

@mdelapenya
Copy link
Contributor

mdelapenya commented Oct 20, 2021

Trying with 298d786: Get metricbeat to compile on AIX

kind delete clusters --all && \
TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true \
GITHUB_CHECK_SHA1=298d786fc67301f429c0fe619fa06787093a6751 \
ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

And the result is BAD 🔴

@mdelapenya
Copy link
Contributor

mdelapenya commented Oct 20, 2021

Trying with 53a618b: docs: link to new APM book

kind delete clusters --all && \
TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true \
GITHUB_CHECK_SHA1=53a618b36135db5c2940e1df48ffad164349b28c \
ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

And the result is BAD 🔴

@mdelapenya
Copy link
Contributor

mdelapenya commented Oct 20, 2021

Trying with 0a24250: [pre-commit] for linting merge-conflict, pipelines and JJBB

kind delete clusters --all && \
TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true \
GITHUB_CHECK_SHA1=0a2425021bfb488cf21927443bbfafc0ec450bb7 \
ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

And the result is GOOD ✅

@mdelapenya
Copy link
Contributor

The only commit that is left in this bisect is 81c38fc: [Heartbeat][Agent] Seccomp / synthetics bugfix improvements, which should fail the tests, being the culprit commit 🤞

kind delete clusters --all && \
TAGS="heartbeat" TIMEOUT_FACTOR=3 LOG_LEVEL=TRACE DEVELOPER_MODE=true \
GITHUB_CHECK_SHA1=81c38fc4c009348d57c92ae85920aed35297a89e \
ELASTIC_APM_ACTIVE=false make -C e2e/_suites/kubernetes-autodiscover functional-test

And, effectively, the result is BAD 🔴, which means that we found the root cause of the issue.

@mdelapenya
Copy link
Contributor

Let me explain what that test does: it uses the k8s-autodiscover test suite (@jsoriano and @ChrsMark can provide more context):

Scenario: Monitor pod availability using hints with named ports
  Given "heartbeat" is running with "hints enabled for pods"
   When "redis" is deployed with "monitor annotations with named port"
   Then "heartbeat" collects events with "kubernetes.pod.name:redis"
    And "heartbeat" collects events with "monitor.status:up"

The scenario always fails in the same step, the Then clause: Then "heartbeat" collects events with "kubernetes.pod.name:redis"

To understand the internals, the scenario creates a Kubernetes cluster with Kind, and starts a pod from Hearbeat in the version specified by the GITHUB_CHECK_SHA1 variable. If not set, the project uses released artifacts, otherwise it will go to our GCP bucket, will look up that commit and will download the artifacts needed by the test, in this case the TAR file representing heartbeat in the Docker image format. It will load the TAR file into the local Docker engine and will import that image into Kind, so that it's available inside the k8s cluster.

As described in the Then clause, the heartbeat pod is not able to collect events after 81c38fc. Tests logs are:

DEBU[2021-10-20T17:36:17+02:00] Failed to copy events from test-e18ad60b-5d71-48da-83cf-e91bb329c0fb/heartbeat-84b998df5d-tdx8x:/tmp/beats-events to /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-106333045/events: exit status 1
TRAC[2021-10-20T17:36:18+02:00] Validating required tools: [kubectl]
TRAC[2021-10-20T17:36:18+02:00] Binary is present binary=kubectl path=/usr/local/bin/kubectl
TRAC[2021-10-20T17:36:18+02:00] Executing command args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-190619484/kubeconfig --namespace test-e18ad60b-5d71-48da-83cf-e91bb329c0fb cp --no-preserve test-e18ad60b-5d71-48da-83cf-e91bb329c0fb/heartbeat-84b998df5d-tdx8x:/tmp/beats-events /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-106333045/events]" command=kubectl env="map[]"
ERRO[2021-10-20T17:36:18+02:00] Error executing command args="[--kubeconfig /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-190619484/kubeconfig --namespace test-e18ad60b-5d71-48da-83cf-e91bb329c0fb cp --no-preserve test-e18ad60b-5d71-48da-83cf-e91bb329c0fb/heartbeat-84b998df5d-tdx8x:/tmp/beats-events /var/folders/8h/pk8n63tn3px_tbs6_l862s_w0000gn/T/test-106333045/events]" baseDir=. command=kubectl env="map[]" error="exit status 1" stderr="error: unable to upgrade connection: container not found ("heartbeat")\n"

Where kubectl returns error: unable to upgrade connection: container not found ("heartbeat")

@mdelapenya
Copy link
Contributor

I'm checking that the only logs we are storing are kind logs, but not the cluster logs. In my local execution, meanwhile we provide the test framework to extract kind's logs (using export logs command), I was able to extract heartbeat pod's logs:

2021-10-20T15:46:33.582605252Z stderr F Exiting: error loading config file: open /etc/heartbeat.yml: permission denied

@mdelapenya
Copy link
Contributor

After debugging the issue with the team:

@andrewvc:

the easy way to test if that's the cause would be to set BEAT_SETUID_AS="" when running the container to override the default value in the container (which is heartbeat)

@mdelapenya:

I think we can declare that variable here https://github.com/elastic/e2e-testing/blob/master/e2e/_suites/kubernetes-autodiscover/testdata/templates/heartbeat.yml.tmpl#L73

After testing it locally adding that variable in tests, for the same culprit commit:
@mdelapenya:

Overriding that variable makes the trick and the tests pass again. I’ll send a PR to modify the descriptor, although I’d like @jsoriano and the Beats Platform Monitoring team review the implications of that change (cc/ @andresrc )

@EricDavisX I'll postpone the resolution until the Beats team tell us that applying that variable at test time is safe

@andresrc andresrc added the Team:Automation Label for the Observability productivity team label Oct 21, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 21, 2021
@EricDavisX
Copy link
Contributor Author

resolved and no complaints so let's close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Team:Automation Label for the Observability productivity team
Projects
None yet
Development

No branches or pull requests

5 participants