add extended runtime test #4150

fearful-symmetry · 2024-01-25T21:10:38Z

What does this PR do?

This adds a test for running the agent for an extended period of time in order to check for stability issues stemming from memory leaks or other runtime problems.

Right now this is in draft mode for a few reasons:

The primary motivation for this test is a bug stemming from leaving windows OS handles around. ~~However, the metric for reporting open handles is not supported on windows. Need to figure out if we want that fixed now or later.~~ It's supported, but it ends up in system.handles instead of beat.handles.
~~The intention is to have the test fail if we exceed a watermark for memory usage/open handle count. However, we don't really know what that number should be.~~ On 8.11.0 with the handle leak, we get about 800 open file handles after about 15 minutes. No a patched version, it's closer to 200.
We still need to test this against the window release that prompted [metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed beats#37142
I would like to see if there's any other opinions about metrics that we should check/fail on

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

mergify · 2024-01-25T21:11:12Z

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

cmacknz · 2024-01-25T21:31:17Z

However, the metric for reporting open handles is not supported on windows. Need to figure out if we want that fixed now or later.

Yes let's do this first and then incorporate it here. It may be easier to detect the leaking handles then the increase in memory, because the handles are strictly increasing and the memory increase is not because of the GC periodically reclaiming some of it (but never back to the original starting point).

elasticmachine · 2024-01-29T16:41:55Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

jlind23 · 2024-01-29T17:33:09Z

@fearful-symmetry will this run as part of the integration testing framework on a PR basis or will it have to be manually triggered?
If it is part of the integration testing framework, will it run in parallel or does it mean we will have X more minutes to the build time?

pierrehilbert · 2024-01-29T17:46:22Z

Even if running it on every PR can bring some value (as it will identify directly what is impacting badly our memory consumption), I think this is probably better to run it at a daily basis and avoid increasing our test duration.
WDYT?

amitkanfer · 2024-01-29T17:57:28Z

Even if running it on every PR can bring some value (as it will identify directly what is impacting badly our memory consumption), I think this is probably better to run it at a daily basis and avoid increasing our test duration.
WDYT?

Would be great to first understand by how much it'll extend the existing duration (and as Julien mentioned - best to run in parallel)

fearful-symmetry · 2024-01-29T18:04:15Z

@amitkanfer We can adjust the runtime as needed. It kind of depends on what we're testing for. In my experience, stuff like the file handle leak we ran into in 8.11.0 will show up after 10-15 minutes, but more subtle memory/resource leaks could potentially take numerous hours.

If we want to specifically look for OS handle leaks as part of CI, I would suggest we maybe investigate some kind of static/dynamic analysis tooling (or possibly invent our own, if there's nothing available for golang) that don't rely on running time to expose issues.

amitkanfer · 2024-01-29T18:06:44Z

We can adjust the runtime as needed. It kind of depends on what we're testing for. In my experience, stuff like the file handle leak we ran into in 8.11.0 will show up after 10-15 minutes, but more subtle memory/resource leaks could potentially take numerous hours.

So if we run it in parallel, i think we can run it for a full hour w/o increasing the current test duration. correct?

fearful-symmetry · 2024-01-29T18:27:53Z

@amitkanfer Yah, we could run it for an hour or so in parallel.

My concern is mostly that the 8.11.0 resource leak was a worst-case-senario in terms of behavior. It opened one handle per process on the system, so it would chew through resources pretty fast. More subtle behavior would definitely require longer tests. I would suggest we run one test on CI, then run a scheduled test on a 12 hour time limit, or something.

amitkanfer · 2024-01-29T19:01:11Z

I would suggest we run one test on CI, then run a scheduled test on a 12 hour time limit, or something.

SGTM!

cmacknz · 2024-01-29T22:03:19Z

Our existing integration test runs take about an hour. We could easily fit a 30-60 minute test like this in on each PR and CI run and it won't be the bottleneck for test time. Or just run for a full hour every time.

We can also run a longer duration test on a schedule but we'll need to maintain two sets of pass/fail criteria.

@fearful-symmetry are you anticipating we only use thresholds for the pass/fail criteria? For something like the handle leak is looking at the derivative of the handle value useful at all?

fearful-symmetry · 2024-01-29T23:23:31Z

@cmacknz right now the idea is to take the max value. However, a max value depends on the runtime, which means that some kind of rate or derivative measurement might also be worth exploring...

cmacknz · 2024-01-30T14:06:14Z

which means that some kind of rate or derivative measurement might also be worth exploring..

Is our number of handles metric a "number of handles ever opened" counter or a "number of handles currently open" gauge. The gauge version seems much more valuable, because I would not expect the total number of handles we have open to increase after a certain point unless we are leaking them.

fearful-symmetry · 2024-01-30T16:52:34Z

@cmacknz it's a "current handles open". I noted in the original PR description, after about 15 minutes, there's a pretty notable difference between the metric on patched and non-patched releases.

fearful-symmetry · 2024-02-16T06:24:04Z

Alright, for what I think is the second or third time, filebeat has stopped ingesting test log files. I've spent all night fighting with it, with no progress. The harvester can see the files:

    "message": "Harvester started for paths: [/tmp/cef/cef*.log]",
    "log.logger": "input.harvester",
    "source_file": "/tmp/cef/cef2377177841.log",

But there's none of the actual logs.

fearful-symmetry · 2024-02-21T17:31:38Z

Alright, saying this is ready for review again. After fighting with it a bunch I think I've addressed everything.

I had to remove spigot, since it's not compiling on windows, and I don't particularly want to block this PR for the sake of getting one dependency to work. Also filebeat refused to read the files.

cmacknz

A bunch of small things that are mostly code cleanup, otherwise LGTM.

Once the cleanup is addressed I'll approve this.

docs/test-framework-dev-guide.md

cmacknz · 2024-03-04T19:53:21Z

magefile.go

@@ -1950,6 +1950,14 @@ func (Integration) TestBeatServerless(ctx context.Context, beatname string) erro
 	return integRunner(ctx, false, "TestBeatsServerless")
 }

+func (Integration) TestExtendedRuntime(ctx context.Context) error {


This wasn't actually changed.

docs/test-framework-dev-guide.md