[Elastic Agent] Investigate running all beats through the system process manager #18362

blakerouse · 2020-05-07T17:22:19Z

Currently Elastic Agent runs filebeat and metricbeat as a subprocesses, but with the addition of endpoint agent needs to manage that through the system process manager (systemd, services.msc, launchctl).

Instead of making endpoint a one off case, I propose we look at running them all through the system process manager. There is benefits to doing this:

System managers will automatically handle the restarting and starting on startup
In the case that agent dies the filebeat, metricbeat, endpoint, and more in the future will keep on running, so no events will be lost.

In the case of the Elastic Agent container the first process can be systemd that spawns Elastic Agent and then all the beats will be managed by that systemd init process.

The text was updated successfully, but these errors were encountered:

ph · 2020-05-08T19:56:18Z

Just added the beta1 label, if we goe this route I believe its the right timing.

blakerouse · 2020-05-08T21:02:50Z

Yes I have been looking into to this.

On each OS it would be a little different so we can create a interface that works for each build of the OS.

Something like below would work.

type Service struct {
    Name string
    Command string
    Args []string
}

type ServiceManager interface {
    CreateService(*Service) error
    StartService(*Service) error
    StopService(*Service) error
    DestroyService(*Service) error
}

type SystemDServiceManager struct {}
type LaunchdServiceManager struct {}
type WindowsServiceManager struct {}
type DockerServiceManager struct {}

In the DockerServiceManager case I think it's best if we still use subprocesses to run the beats, but that can be encapsulated in the manager. The reason is that it seems that systemd inside of a container does not work well because it also assigns cgroups to each process, which has an issue inside of a container. It is possible mounting /sys/fs/cgroup and running the container privileged but I don't think that we want that requirement.

Another option could be to have an agent container talk through OCI, docker, or Kubernetes to spawn the containers on the actual platform, so it could be that Elastic Agent is running in a Pod then it spawns another Pod to run filebeat and another Pod to run metricbeat inside of Kubernetes. Then it would be kubernetes job to ensure that the Pod stays running and not Agent itself. Which if you think about it could be a great way to start monitoring a Kubernetes. Deploy a single Agent then it will grow the filebeats and metricbeats as needed.

ph · 2020-05-12T14:21:54Z

@blakerouse This look indeed like a good plan, concernig the k8s / docker scenario as proposed we might now need it now. I am worried about the management of mount point for the pods.

The problem I see with this, we start multiples instances of Metricbeat/Filebeat at the moment and I don't think that would fit well in the Systemd, would that beat different installed services? ie Filebeat1

including @ruflin and @michalpristas

ph · 2020-05-12T14:23:49Z

@ruflin Would this align with what you had in mind for autodiscovery?

Another option could be to have an agent container talk through OCI, docker, or Kubernetes to spawn the containers on the actual platform, so it could be that Elastic Agent is running in a Pod then it spawns another Pod to run filebeat and another Pod to run metricbeat inside of Kubernetes. Then it would be kubernetes job to ensure that the Pod stays running and not Agent itself. Which if you think about it could be a great way to start monitoring a Kubernetes. Deploy a single Agent then it will grow the filebeats and metricbeats as needed.

ruflin · 2020-05-13T09:21:18Z

I like the overall idea especially as it aligns that all processes are run the same. Will the above mean it will not be possible to run more then once instance of the agent in one OS?

I wonder if it is a feature or a bug that the sub process keep running if the agent dies. Will they reconnect as soon as the agent is available again?

For Docker / K8s: I don't think it is related to how autodiscovery will work but I probably miss something here. +1 on focusing first on all the other OS / deployment models.

Will this fully replace the "process" model or can the user choose to still use the process model if he does not want to use systemd for example?

michalpristas · 2020-05-13T09:38:55Z

we were discussing with blake slow onboarding to services in a way that e.g endpoint can be a service and beats processes ... it should be just a different implementation of the same interface and with this in mind we can make it even configurable as you say.
with the services i fear what ph mentioned and that's multiple instances. we will probably need multiple outputs support in beats (not sure if this has been worked on already).

also with services we dont need to keep track of processes we know the state of the service and we can start/stop when needed with grpc flow flipped process will query for configuration periodically so it should be much more simple regarding the flow

ph · 2020-05-13T19:24:23Z

@michalpristas @blakerouse overall that would simplify the problem, I am assuming we will have some bookkeeping to do on the agent side to detect missing process?

I am assuming that we will monitoring the output of systemctl start XXX and the log to see what errors we could have on startups?

Another thing, we could also better restrict users on the process based on that strategy.

ph · 2020-05-13T19:25:39Z

wonder if it is a feature or a bug that the sub process keep running if the agent dies. Will they reconnect as soon as the agent is available again?

Well, when the Agent manages the process and get kill there is a rippled effect and the dependent process should be cleaned. If we delegate that to the system, this mean that we could have zombies processes sending data to elasticsearch.

This could probably be solved by having a grace period, If I cannot reach the Agent after X amount of time I should suspend myself and wait for agent to come back.

blakerouse · 2020-05-19T12:10:15Z

After some investigations it seems that as uniform this would for beats and endpoint this will open a security issue, with getting the certificates and unique token to the application. Staying with subprocesses and having endpoint be unique is the best scenario for now.

Closing this issue.

blakerouse added enhancement Team:Ingest Management labels May 7, 2020

ph added the Ingest Management:beta1 Group issues for ingest management beta1 label May 8, 2020

blakerouse mentioned this issue May 11, 2020

[Elastic Agent] Beats Connect to the Elastic Agent. #18024

Closed

ph assigned blakerouse May 12, 2020

ph added the discuss Issue needs further discussion. label May 12, 2020

blakerouse closed this as completed May 19, 2020

blakerouse removed Ingest Management:beta1 Group issues for ingest management beta1 discuss Issue needs further discussion. labels May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] Investigate running all beats through the system process manager #18362

[Elastic Agent] Investigate running all beats through the system process manager #18362

blakerouse commented May 7, 2020

ph commented May 8, 2020

blakerouse commented May 8, 2020

ph commented May 12, 2020

ph commented May 12, 2020

ruflin commented May 13, 2020

michalpristas commented May 13, 2020

ph commented May 13, 2020

ph commented May 13, 2020 •

edited

Loading

blakerouse commented May 19, 2020

[Elastic Agent] Investigate running all beats through the system process manager #18362

[Elastic Agent] Investigate running all beats through the system process manager #18362

Comments

blakerouse commented May 7, 2020

ph commented May 8, 2020

blakerouse commented May 8, 2020

ph commented May 12, 2020

ph commented May 12, 2020

ruflin commented May 13, 2020

michalpristas commented May 13, 2020

ph commented May 13, 2020

ph commented May 13, 2020 • edited Loading

blakerouse commented May 19, 2020

ph commented May 13, 2020 •

edited

Loading