-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Longevity test for Elastic Agent and Beats #3833
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@cmacknz @leehinman does the elastic agent testing framework have the necessary infrastructure for this to happened? |
Hmm, it can orchestrate the test setup (install agent on a VM, creating a stack deployment, etc) but we don't have a concept of a test that runs for hours or days intentionally. This is probably possible though. We could try writing dedicate soak integration tests triggered on a schedule, that run for a full day and periodically check the metrics and health of the agent doing representative tasks. We can start with the system integration installed and expand from there. |
The alert would just be the test failing which is a nice way to get an alert about this. |
One thing that jumped out from elastic/beats#37142 that could allow us to write a more focused test for leaking handles on Windows is that there is a metric for open process handles and it increased over time as this problem was happening. Here is a screenshot from the linked issue showing this: Likely we can write a much shorter test to detect leaking file and process handles by starting the system integration, running through at least one data collection interval, and verifying that the number of handles is stable. |
We can probably use pprof from integration(or another type of tests) to get the metrics and get comparable reports |
How feasible would it be to generate a pprof everytime integration tests are run? I guess they will be generated on the provisioned VM and then have to be moved/uploaded to a central place right? |
Our diagnostics collect these at the end of every failing test already. For determining if there is a leak we don't need pprof, pprof will tell us where and why we have a leak. Instead we should be able the monitor the memory of the running agent processes. We can do this in a few ways:
|
Well going to upgrade to the 4th maintenance release to mitigate this risk, as we’re going to be deploying elastic agent to our windows production network which supports core banking operations over the next few months. You need to test for longer before a GA elastic agent release. If elastic-agent causes an outage in business operations by crashing servers I’ll get sacked 100%. We’re already banned from using Elastic-Defend because it maxed out CPU on 16 core production banking servers. We need a strict policy where the 4th maintenance release is fully tested and stable, and every minor release has a 4th maintenance release. Elastic also needs to add rollback and bug notification into Fleet. Fleet/Elastic agent might be GA but you’re still building the platform so there’s a high risk of more severe bugs in the future. |
Same here we cant use Elastic Defend after we found that one of our production machines was overloaded. |
Thanks for the feedback, we know we need to be better here.
This is tracked in elastic/kibana#172745. We know we need this, but we have to work through some of the backwards compatibility implications since if we are going to enable it we want to make sure it works reliably. We are separately working to enable releasing fixes for Elastic Agent much faster. Ideally we want to be able to ship the fix for problems like elastic/beats#37142 within 24 hours of the fix being made. This is balanced against the need for adequate soak testing as noted. |
I should specifically note we also know we shouldn't have released this in the first place. In addition to ensuring we can react to unexpected problems faster we are putting more tests, checks, and processes in place to avoid unexpected problems wherever we can. |
So, I'm assuming we want to start with a "base" configuration (system metrics, logs) and expand from there. Considering the nature of the test, we can and probably should expand to a variety of configs/use cases/stories/whatever. For the base case, we still have a few questions:
|
For the purposes of this test we don't need to run for 24h. We need to run for long enough to detect that resource usage is increasing beyond what we expect it to be. I expect that most true leaks will be observable much faster than that, for example over a 1h period memory usage should not be trending up continuously if the system is at steady state in terms of the work we have made agent do.
The important thing to do is establish a baseline of resource usage and what variation it has. How long do we need to watch agent for resource usage to stabilize? We also need to make sure we are doing representative work, the system integration needs to be installed and continuously ingesting logs and metrics for example. Once we establish a baseline of resource usage we could then unconditionally fail the test if it exceeds it whatever that is defined to be, this will automatically result in us capturing heap profiles and diagnostics. A good exercise would be to create a branch which reverts the fix for elastic/beats#37142 or otherwise introduces an obvious resource leak and make sure that whatever conditions you put in place detect it. It could also be valuable to take a closer look at how quickly the open handles and memory usage increases. For problems related to metrics collection the rate at which we leak handles/memory/whatever is likely to correlate with the metrics collection interval, so we can accelerate that to get the system to fail faster. We can have agent collect metrics every 1s or faster for example.
Continue to use the existing build failure notifications. Cleaning those up is happening as a separate effort. |
Elastic-defend on pre 5.10.16 versions of the linux kernel falls back to a debugger called tracefs instead of using eBPF. We found this by running Elastic-defend on 16 core production linux servers which were built 2-3 years ago (CPU ran between 60-98%)). They get patched but the kernel doesn't get updated. It was either that or monitoring auditd system calls. https://github.com/elastic/ebpf Elastic-Defend would be good to include in longevity tests, using Windows/Linux servers running under load to replicate production (both new and old Linux kernels). |
The test suite we use for Elastic Agent here can and does run Elastic Defend. There is also an entirely separate team dedicated purely to defend covering defend specific problems, we primarily focus on ensuring we can install/upgrade/monitor defend through Elastic Agent. We also have dedicated performance testing infrastructure using our BPF based whole system profiler, but BPF doesn't help on Windows. There is some overlap here with what that infrastructure does, which is more focused on throughput testing than cross-platform functional testing. |
So, it sounds to me like we want to roll this out in phases:
|
Yes. We will want this enabled by default for every test we have eventually. Ideally we can find a way to avoid just hard coding the pass/fail thresholds, and come up with something like "total memory usage at the end of the test needs to be within X% of the starting memory usage". We should have a threshold as well as a sanity check, we don't want to be using 1 GB and 1.1 GB of memory and not notice it doing something simple, but if we depend on having a finly tuned threshold that is going to end up being too flaky. I would suggest starting with the open handles metric in #3833 (comment) and look at memory afterwards. There is no reason that should be increasing continuously. Files we are reading will stay open but they shouldn't be reopened continuously. I hope that is an easy one to add and would immediately prevent handle leaks on Windows, you should be able to add this and prove it would have detected the previous snapshot handle leak. |
What do you mean by |
Ideally every integration test we have should have a check at the end of it to ensure it didn't leak memory, handles, or any other resource leak we can feasibly detect. |
So, as far as I can tell, there's nothing on metricbeat for windows that reports the count of open file handles. The |
Spoke with @fearful-symmetry in our team meeting today, we need to make sure that the test forces the agent to do real work at an accelerated rate. This should help us identify leaks much faster. My ideal test here is primarily a stress test that after 15-30 minutes can detect increasing resource usage. This will make it possible to run this test directly in our PRs instead of separately on a schedule. We can do this by:
I've added these steps to the issue description. |
@cmacknz so, I'm a little worried about the first item on that list, and I wonder if perhaps it should be broken out to a separate unit test or something? Wouldn't any relevant policy updates force restarts of components or underlying modules in beats? If so, that might interrupt any kind of in-process resource leak. |
The agent should not restart components that are unmodified. It would only be starting and stopping the input we are adding and removing. There are some quirks to this, like if you start changing output parameters the Beats will restart, which is why I suggested only adding and removing a single input. I am fine with breaking this up and treating this as a parent level meta issue. What I don't want is for this issue to be closed if we have only solved the problem for a narrow set of possible resource leaks. |
Yah, the first item is different enough that I feel like it should probably be spun out into a different issue, since so far this has been focused on the behavior of the integrations and resource leaks stemming from OS/external APIs. The PR has been updated to reduce the collection interval to |
I had a thought on how we could "test the test" here as part of our exercise to prove this can actually detect resource leaks: Can we implement an input that can be configured to leak memory and OS level resources in various ways? We could constantly leak goroutines at a configurable rate, leak Windows handles, file handles, open an increasing number of sockets, etc. This would give us a way to evaluate the effectiveness and stability of the test. We are going to need to spend considerable time convincing ourselves the false positive rate is acceptable, and also that we can still detect true positives. |
I've split the implementation here into pieces, from easiest (in the sense that we have already implemented in a PR) to more challenging. The goal is to get checks we think we can stabilize onto main sooner. |
Something else to follow up on, we can likely do targeted versions of these tests in individual unit tests. For goroutines there is https://github.com/uber-go/goleak that could be used off the shelf, but I bet we could write a similar detection package for OS handle leaks that can be used directly in unit tests. |
@cmacknz yeah, I've been tinkering with ideas for some kind of handle leak detector that can be run on the unit-test level. I would argue it's sort of necessary, since there's no way we could test everything in the existing integration test. I'll bring it up at the next team meeting |
After discussing with @fearful-symmetry , #4208 isn't really part of this meta scope and I will close this one. |
npcap in the network packet capture integration has been reported as having a memory leak. This is running on many critical servers including domain controllers. |
The team recently fixed a bug in the
elastic-agent-system-metrics
code base where we were continuously increasing our memory consumption on Windows until BSOD.Here is the issue raised by the community - elastic/beats#37142
Here is the fix made by @leehinman - elastic/elastic-agent-system-metrics#115
On way to avoid this to happen again would be to have a cloud stack with Windows, Mac & Linux agents in the cloud for several days after each BC, where data will continuously be ingested from the system. Based on this data we should set up alerts for memory usage.
This test must make sure that the agent is doing representative work. To minimize the expected time to identify resource leaks, we should accelerate the natural rate of data ingestion and change in the system to stress the agent and accelerate any leaks that may be present.
We can do this by:
Configure
RPC.elastic-agent/control_v2.proto
Lines 361 to 367 in d8b446f
This will exclude the code paths that interact with Fleet, but we can address that as a follow up. Likely the policy reconfiguration can be driven with our fake fleet server as a future enhancement, with the agent configured to check in rapidly.
We can introduce the leak detection checks we need in phases, tracked in:
elastic-agent status
should forward info from running processes #4210The text was updated successfully, but these errors were encountered: