Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Investigate running all beats through the system process manager #18362

Closed
blakerouse opened this issue May 7, 2020 · 9 comments
Assignees

Comments

@blakerouse
Copy link
Contributor

Currently Elastic Agent runs filebeat and metricbeat as a subprocesses, but with the addition of endpoint agent needs to manage that through the system process manager (systemd, services.msc, launchctl).

Instead of making endpoint a one off case, I propose we look at running them all through the system process manager. There is benefits to doing this:

  • System managers will automatically handle the restarting and starting on startup
  • In the case that agent dies the filebeat, metricbeat, endpoint, and more in the future will keep on running, so no events will be lost.

In the case of the Elastic Agent container the first process can be systemd that spawns Elastic Agent and then all the beats will be managed by that systemd init process.

@ph ph added the Ingest Management:beta1 Group issues for ingest management beta1 label May 8, 2020
@ph
Copy link
Contributor

ph commented May 8, 2020

Just added the beta1 label, if we goe this route I believe its the right timing.

@blakerouse
Copy link
Contributor Author

Yes I have been looking into to this.

On each OS it would be a little different so we can create a interface that works for each build of the OS.

Something like below would work.

type Service struct {
    Name string
    Command string
    Args []string
}

type ServiceManager interface {
    CreateService(*Service) error
    StartService(*Service) error
    StopService(*Service) error
    DestroyService(*Service) error
}

type SystemDServiceManager struct {}
type LaunchdServiceManager struct {}
type WindowsServiceManager struct {}
type DockerServiceManager struct {}

In the DockerServiceManager case I think it's best if we still use subprocesses to run the beats, but that can be encapsulated in the manager. The reason is that it seems that systemd inside of a container does not work well because it also assigns cgroups to each process, which has an issue inside of a container. It is possible mounting /sys/fs/cgroup and running the container privileged but I don't think that we want that requirement.

Another option could be to have an agent container talk through OCI, docker, or Kubernetes to spawn the containers on the actual platform, so it could be that Elastic Agent is running in a Pod then it spawns another Pod to run filebeat and another Pod to run metricbeat inside of Kubernetes. Then it would be kubernetes job to ensure that the Pod stays running and not Agent itself. Which if you think about it could be a great way to start monitoring a Kubernetes. Deploy a single Agent then it will grow the filebeats and metricbeats as needed.

@ph
Copy link
Contributor

ph commented May 12, 2020

@blakerouse This look indeed like a good plan, concernig the k8s / docker scenario as proposed we might now need it now. I am worried about the management of mount point for the pods.

The problem I see with this, we start multiples instances of Metricbeat/Filebeat at the moment and I don't think that would fit well in the Systemd, would that beat different installed services? ie Filebeat1

including @ruflin and @michalpristas

@ph ph added the discuss Issue needs further discussion. label May 12, 2020
@ph
Copy link
Contributor

ph commented May 12, 2020

@ruflin Would this align with what you had in mind for autodiscovery?

Another option could be to have an agent container talk through OCI, docker, or Kubernetes to spawn the containers on the actual platform, so it could be that Elastic Agent is running in a Pod then it spawns another Pod to run filebeat and another Pod to run metricbeat inside of Kubernetes. Then it would be kubernetes job to ensure that the Pod stays running and not Agent itself. Which if you think about it could be a great way to start monitoring a Kubernetes. Deploy a single Agent then it will grow the filebeats and metricbeats as needed.

@ruflin
Copy link
Collaborator

ruflin commented May 13, 2020

I like the overall idea especially as it aligns that all processes are run the same. Will the above mean it will not be possible to run more then once instance of the agent in one OS?

I wonder if it is a feature or a bug that the sub process keep running if the agent dies. Will they reconnect as soon as the agent is available again?

For Docker / K8s: I don't think it is related to how autodiscovery will work but I probably miss something here. +1 on focusing first on all the other OS / deployment models.

Will this fully replace the "process" model or can the user choose to still use the process model if he does not want to use systemd for example?

@michalpristas
Copy link
Contributor

we were discussing with blake slow onboarding to services in a way that e.g endpoint can be a service and beats processes ... it should be just a different implementation of the same interface and with this in mind we can make it even configurable as you say.
with the services i fear what ph mentioned and that's multiple instances. we will probably need multiple outputs support in beats (not sure if this has been worked on already).

also with services we dont need to keep track of processes we know the state of the service and we can start/stop when needed with grpc flow flipped process will query for configuration periodically so it should be much more simple regarding the flow

@ph
Copy link
Contributor

ph commented May 13, 2020

@michalpristas @blakerouse overall that would simplify the problem, I am assuming we will have some bookkeeping to do on the agent side to detect missing process?

I am assuming that we will monitoring the output of systemctl start XXX and the log to see what errors we could have on startups?

Another thing, we could also better restrict users on the process based on that strategy.

@ph
Copy link
Contributor

ph commented May 13, 2020

wonder if it is a feature or a bug that the sub process keep running if the agent dies. Will they reconnect as soon as the agent is available again?

Well, when the Agent manages the process and get kill there is a rippled effect and the dependent process should be cleaned. If we delegate that to the system, this mean that we could have zombies processes sending data to elasticsearch.

This could probably be solved by having a grace period, If I cannot reach the Agent after X amount of time I should suspend myself and wait for agent to come back.

@blakerouse
Copy link
Contributor Author

After some investigations it seems that as uniform this would for beats and endpoint this will open a security issue, with getting the certificates and unique token to the application. Staying with subprocesses and having endpoint be unique is the best scenario for now.

Closing this issue.

@blakerouse blakerouse removed Ingest Management:beta1 Group issues for ingest management beta1 discuss Issue needs further discussion. labels May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants