Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add extended runtime test #4150

Merged
merged 59 commits into from
Mar 6, 2024

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Jan 25, 2024

What does this PR do?

Closes #4206

This adds a test for running the agent for an extended period of time in order to check for stability issues stemming from memory leaks or other runtime problems.

Right now this is in draft mode for a few reasons:

  1. The primary motivation for this test is a bug stemming from leaving windows OS handles around. However, the metric for reporting open handles is not supported on windows. Need to figure out if we want that fixed now or later. It's supported, but it ends up in system.handles instead of beat.handles.

  2. The intention is to have the test fail if we exceed a watermark for memory usage/open handle count. However, we don't really know what that number should be. On 8.11.0 with the handle leak, we get about 800 open file handles after about 15 minutes. No a patched version, it's closer to 200.

  3. We still need to test this against the window release that prompted [metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed beats#37142

  4. I would like to see if there's any other opinions about metrics that we should check/fail on

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Copy link
Contributor

mergify bot commented Jan 25, 2024

This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@cmacknz
Copy link
Member

cmacknz commented Jan 25, 2024

However, the metric for reporting open handles is not supported on windows. Need to figure out if we want that fixed now or later.

Yes let's do this first and then incorporate it here. It may be easier to detect the leaking handles then the increase in memory, because the handles are strictly increasing and the memory increase is not because of the GC periodically reclaiming some of it (but never back to the original starting point).

@fearful-symmetry fearful-symmetry marked this pull request as ready for review January 29, 2024 16:41
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner January 29, 2024 16:41
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@jlind23
Copy link
Contributor

jlind23 commented Jan 29, 2024

@fearful-symmetry will this run as part of the integration testing framework on a PR basis or will it have to be manually triggered?
If it is part of the integration testing framework, will it run in parallel or does it mean we will have X more minutes to the build time?

@pierrehilbert
Copy link
Contributor

Even if running it on every PR can bring some value (as it will identify directly what is impacting badly our memory consumption), I think this is probably better to run it at a daily basis and avoid increasing our test duration.
WDYT?

@amitkanfer
Copy link
Contributor

Even if running it on every PR can bring some value (as it will identify directly what is impacting badly our memory consumption), I think this is probably better to run it at a daily basis and avoid increasing our test duration.
WDYT?

Would be great to first understand by how much it'll extend the existing duration (and as Julien mentioned - best to run in parallel)

@fearful-symmetry
Copy link
Contributor Author

@amitkanfer We can adjust the runtime as needed. It kind of depends on what we're testing for. In my experience, stuff like the file handle leak we ran into in 8.11.0 will show up after 10-15 minutes, but more subtle memory/resource leaks could potentially take numerous hours.

If we want to specifically look for OS handle leaks as part of CI, I would suggest we maybe investigate some kind of static/dynamic analysis tooling (or possibly invent our own, if there's nothing available for golang) that don't rely on running time to expose issues.

@amitkanfer
Copy link
Contributor

We can adjust the runtime as needed. It kind of depends on what we're testing for. In my experience, stuff like the file handle leak we ran into in 8.11.0 will show up after 10-15 minutes, but more subtle memory/resource leaks could potentially take numerous hours.

So if we run it in parallel, i think we can run it for a full hour w/o increasing the current test duration. correct?

@fearful-symmetry
Copy link
Contributor Author

@amitkanfer Yah, we could run it for an hour or so in parallel.

My concern is mostly that the 8.11.0 resource leak was a worst-case-senario in terms of behavior. It opened one handle per process on the system, so it would chew through resources pretty fast. More subtle behavior would definitely require longer tests. I would suggest we run one test on CI, then run a scheduled test on a 12 hour time limit, or something.

@amitkanfer
Copy link
Contributor

I would suggest we run one test on CI, then run a scheduled test on a 12 hour time limit, or something.

SGTM!

@cmacknz
Copy link
Member

cmacknz commented Jan 29, 2024

Our existing integration test runs take about an hour. We could easily fit a 30-60 minute test like this in on each PR and CI run and it won't be the bottleneck for test time. Or just run for a full hour every time.

We can also run a longer duration test on a schedule but we'll need to maintain two sets of pass/fail criteria.

@fearful-symmetry are you anticipating we only use thresholds for the pass/fail criteria? For something like the handle leak is looking at the derivative of the handle value useful at all?

@fearful-symmetry
Copy link
Contributor Author

@cmacknz right now the idea is to take the max value. However, a max value depends on the runtime, which means that some kind of rate or derivative measurement might also be worth exploring...

@cmacknz
Copy link
Member

cmacknz commented Jan 30, 2024

which means that some kind of rate or derivative measurement might also be worth exploring..

Is our number of handles metric a "number of handles ever opened" counter or a "number of handles currently open" gauge. The gauge version seems much more valuable, because I would not expect the total number of handles we have open to increase after a certain point unless we are leaking them.

@fearful-symmetry
Copy link
Contributor Author

@cmacknz it's a "current handles open". I noted in the original PR description, after about 15 minutes, there's a pretty notable difference between the metric on patched and non-patched releases.

@fearful-symmetry
Copy link
Contributor Author

Alright, for what I think is the second or third time, filebeat has stopped ingesting test log files. I've spent all night fighting with it, with no progress. The harvester can see the files:

    "message": "Harvester started for paths: [/tmp/cef/cef*.log]",
    "log.logger": "input.harvester",
    "source_file": "/tmp/cef/cef2377177841.log",

But there's none of the actual logs.

@fearful-symmetry
Copy link
Contributor Author

Alright, saying this is ready for review again. After fighting with it a bunch I think I've addressed everything.

I had to remove spigot, since it's not compiling on windows, and I don't particularly want to block this PR for the sake of getting one dependency to work. Also filebeat refused to read the files.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of small things that are mostly code cleanup, otherwise LGTM.

Once the cleanup is addressed I'll approve this.

magefile.go Outdated
@@ -1950,6 +1950,14 @@ func (Integration) TestBeatServerless(ctx context.Context, beatname string) erro
return integRunner(ctx, false, "TestBeatsServerless")
}

func (Integration) TestExtendedRuntime(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't actually changed.

installPackage.Name = fmt.Sprintf("%s-long-test-%s", name, policyUUID)
installPackage.Vars = map[string]interface{}{}

runner.T().Logf("Installing %s package....", name)
Copy link
Member

@pchila pchila Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function doesn't need to be part of the fixture just to get runner.T() we can pass t *testing.T as parameter
we can also pass in the KibanaClient below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does use the runners to fetch the kibana info. Figured that was cleaner than passing a few extra args.

ohc, ok := handle.handle.(types.OpenHandleCounter)
if ok {
handleCount, err := ohc.OpenHandleCount()
require.NoError(runner.T(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we fail the test and stop immediately if we fail to fetch one data point ?
If we use assert instead of require we could have more insight on how often this happens and keep running until the end to see if there's a trend or other issues
Same for the require in line 254

Copy link
Contributor Author

@fearful-symmetry fearful-symmetry Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that call isn't really expected to fail under a normal workload; a failure most likely means that the process doesn't exist or is no longer running, which probably is an immediate failure.

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
100.0% 100.0% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@fearful-symmetry fearful-symmetry requested a review from cmacknz March 5, 2024 18:40
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for sticking with all the comments. I like how small the implementation turned out to be considering how complex this test actually is.

@cmacknz
Copy link
Member

cmacknz commented Mar 5, 2024

@leehinman you still have changes requested, time for another look.

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not re-reviewing earlier. I'm good with changes, I think we should merge and see how it "really" works in CI.

@fearful-symmetry fearful-symmetry merged commit b5fdc6d into elastic:main Mar 6, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a test that can detect OS handle leaks
9 participants