-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add extended runtime test #4150
add extended runtime test #4150
Conversation
This pull request does not have a backport label. Could you fix it @fearful-symmetry? 🙏
NOTE: |
Yes let's do this first and then incorporate it here. It may be easier to detect the leaking handles then the increase in memory, because the handles are strictly increasing and the memory increase is not because of the GC periodically reclaiming some of it (but never back to the original starting point). |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@fearful-symmetry will this run as part of the integration testing framework on a PR basis or will it have to be manually triggered? |
Even if running it on every PR can bring some value (as it will identify directly what is impacting badly our memory consumption), I think this is probably better to run it at a daily basis and avoid increasing our test duration. |
Would be great to first understand by how much it'll extend the existing duration (and as Julien mentioned - best to run in parallel) |
@amitkanfer We can adjust the runtime as needed. It kind of depends on what we're testing for. In my experience, stuff like the file handle leak we ran into in 8.11.0 will show up after 10-15 minutes, but more subtle memory/resource leaks could potentially take numerous hours. If we want to specifically look for OS handle leaks as part of CI, I would suggest we maybe investigate some kind of static/dynamic analysis tooling (or possibly invent our own, if there's nothing available for golang) that don't rely on running time to expose issues. |
So if we run it in parallel, i think we can run it for a full hour w/o increasing the current test duration. correct? |
@amitkanfer Yah, we could run it for an hour or so in parallel. My concern is mostly that the 8.11.0 resource leak was a worst-case-senario in terms of behavior. It opened one handle per process on the system, so it would chew through resources pretty fast. More subtle behavior would definitely require longer tests. I would suggest we run one test on CI, then run a scheduled test on a 12 hour time limit, or something. |
SGTM! |
Our existing integration test runs take about an hour. We could easily fit a 30-60 minute test like this in on each PR and CI run and it won't be the bottleneck for test time. Or just run for a full hour every time. We can also run a longer duration test on a schedule but we'll need to maintain two sets of pass/fail criteria. @fearful-symmetry are you anticipating we only use thresholds for the pass/fail criteria? For something like the handle leak is looking at the derivative of the handle value useful at all? |
@cmacknz right now the idea is to take the max value. However, a max value depends on the runtime, which means that some kind of rate or derivative measurement might also be worth exploring... |
Is our number of handles metric a "number of handles ever opened" counter or a "number of handles currently open" gauge. The gauge version seems much more valuable, because I would not expect the total number of handles we have open to increase after a certain point unless we are leaking them. |
@cmacknz it's a "current handles open". I noted in the original PR description, after about 15 minutes, there's a pretty notable difference between the metric on patched and non-patched releases. |
Alright, for what I think is the second or third time, filebeat has stopped ingesting test log files. I've spent all night fighting with it, with no progress. The harvester can see the files:
But there's none of the actual logs. |
Alright, saying this is ready for review again. After fighting with it a bunch I think I've addressed everything. I had to remove spigot, since it's not compiling on windows, and I don't particularly want to block this PR for the sake of getting one dependency to work. Also filebeat refused to read the files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of small things that are mostly code cleanup, otherwise LGTM.
Once the cleanup is addressed I'll approve this.
magefile.go
Outdated
@@ -1950,6 +1950,14 @@ func (Integration) TestBeatServerless(ctx context.Context, beatname string) erro | |||
return integRunner(ctx, false, "TestBeatsServerless") | |||
} | |||
|
|||
func (Integration) TestExtendedRuntime(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't actually changed.
installPackage.Name = fmt.Sprintf("%s-long-test-%s", name, policyUUID) | ||
installPackage.Vars = map[string]interface{}{} | ||
|
||
runner.T().Logf("Installing %s package....", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function doesn't need to be part of the fixture just to get runner.T()
we can pass t *testing.T
as parameter
we can also pass in the KibanaClient below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does use the runners to fetch the kibana info. Figured that was cleaner than passing a few extra args.
ohc, ok := handle.handle.(types.OpenHandleCounter) | ||
if ok { | ||
handleCount, err := ohc.OpenHandleCount() | ||
require.NoError(runner.T(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we fail the test and stop immediately if we fail to fetch one data point ?
If we use assert
instead of require
we could have more insight on how often this happens and keep running until the end to see if there's a trend or other issues
Same for the require in line 254
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, that call isn't really expected to fail under a normal workload; a failure most likely means that the process doesn't exist or is no longer running, which probably is an immediate failure.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for sticking with all the comments. I like how small the implementation turned out to be considering how complex this test actually is.
@leehinman you still have changes requested, time for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not re-reviewing earlier. I'm good with changes, I think we should merge and see how it "really" works in CI.
What does this PR do?
Closes #4206
This adds a test for running the agent for an extended period of time in order to check for stability issues stemming from memory leaks or other runtime problems.
Right now this is in draft mode for a few reasons:
The primary motivation for this test is a bug stemming from leaving windows OS handles around.
However, the metric for reporting open handles is not supported on windows. Need to figure out if we want that fixed now or later.It's supported, but it ends up insystem.handles
instead ofbeat.handles
.The intention is to have the test fail if we exceed a watermark for memory usage/open handle count. However, we don't really know what that number should be.On 8.11.0 with the handle leak, we get about 800 open file handles after about 15 minutes. No a patched version, it's closer to 200.We still need to test this against the window release that prompted [metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed beats#37142
I would like to see if there's any other opinions about metrics that we should check/fail on
Checklist
./changelog/fragments
using the changelog tool