-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auditbeat: system/process module backed by quark #42032
Conversation
This introduces a new provider for the sytem/process module in linux. The main motivation is to address some of the limitations of the current implementation. The gosysinfo provider sends state reports by scraping /proc from time to time, so it loses all short lived processes. Some customers also would like to have full telemetry but can't run auditd for various reasons. As a bonus we get some extra ECS fields that were not available before. MAIN DIFFERENCES: * Publishes every process in the system, regardless of lifespan. * Publishes exec events for an existing process (without a fork). * Aggregates fork+exec+exit within one event. * Adds event.exit_code for processes that exited, can't express exit_time in ECS? * Include the original process.args, sysinfo reports args that were fetched when it parsed /proc, so a userland process can masquerade itself. For the initial /proc scraping we report the current value like sysinfo. We can't get the original value since the kernel overwrites it, if you wanna have fun: https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165 * Adds process.args_count. * Adds process.interactive and if true, process.tty.char_device.{major,minor} * Attempts to hash all processes, not just long lived ones. * Hashing is not rate-limited anymore, but it's cached and refreshed based on metadata. It's a LRU keyed by path and refreshed if the metadata of the file changes, statx(2) if the kernel supports, stat(2) otherwise. * No more periodic state reports, only initial batch. * No more saving the timestamp of the last state-report in disk. * No more /proc parsing during runtime, only on boot. MISSING: * Unify entity id with sessionview. * Publish metrics from quark.Stats(). * Docs. * Properly define config options and names. EXTRA CHANGES: * Added statx(2) to seccomp_linux so we can properly use CachedHasher. * Updated quark to 0.3 so we have namespace inode numbers.
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
if err := c.HasherConfig.Validate(); err != nil { | ||
return err | ||
} | ||
if c.Backend != "quark" && c.Backend != "proc" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent with add_session_metadata in terms of backend
option names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I've changed it to kernel_tracing like the processor, I was hoping we could discuss the naming but it makes more sense to stick with the same for now.
This pull request is now in conflicts. Could you fix it? 🙏
|
Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
quark-0.3 pulls testify 1.10, which reveals this bug in filebeat: #34870 (comment) |
38a70dc
to
79c6bae
Compare
I'm more or less ready to commit this, as there were some main merges since I did the heavy testing I just want to redo them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haesbaert is there any change you forgot to commit a file? I tried to run it, but it does not accept the quark
backend:
{"log.level":"error","@timestamp":"2025-02-20T16:05:54.355+0100","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1589},"message":"Exiting: 1 error: failed to unpack the system/process config: invalid process.backend 'quark' accessing 'auditbeat.modules.2' (source:'auditbeat-anderson.yml')","service.name":"auditbeat","ecs.version":"1.6.0"}
I had a quick look and it seems the config does not accept the quark backed
beats/x-pack/auditbeat/module/system/process/config.go
Lines 23 to 33 in d3285d4
// Validate validates the config. | |
func (c *Config) Validate() error { | |
if err := c.HasherConfig.Validate(); err != nil { | |
return err | |
} | |
if c.Backend != "kernel_tracing" && c.Backend != "procfs" { | |
return fmt.Errorf("invalid process.backend '%s'", c.Backend) | |
} | |
return nil | |
} |
this is the config I used:
http:
enabled: true
auditbeat.modules:
- module: system
datasets:
- process
process.backend: "quark"
path.home: /tmp/beat
output.file:
path: /tmp/beat/
filename: auditbeat-output-file
rotate_every_kb: 10000
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
logging.level: info
monitoring.enabled: true
logging.metrics:
- enabled:
|
||
var userString string | ||
if len(username) > 0 { | ||
userString = fmt.Sprintf(" by user %v", username) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion | question]
Why to use %v
for string?
return fmt.Sprintf("Process %v (PID: %d)%v %v", | ||
name, pid, userString, actionString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion | question]
Why to use %v
for string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really my code, this is what the processor was already doing, I just unified the call:
https://github.com/elastic/beats/blob/main/x-pack/auditbeat/module/system/process/gosysinfo_provider.go#L373-L374
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to change the output of all of those, it looks a bit odd, but I'd rather address all that in a future PR that also changes gosysinfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just until we clarify it runs. as you can see in my comment above, I cannot run it
the keyword changed to "kernel_tracing" instead of "quark" for the backend, so if you change your config it should work. |
Correct, it changed to "kernel_tracing", it will try quark on ebpf, if it fails, it will try quark on kprobe, if it fails it fallsback to gosysinfo.
should work |
ah, ok. could you please update the "How to test this PR" section? |
zefixed, thanks for testing :) |
This is the output of valgrind after ~6h of quark-mon, running together with auditbeat, just to make sure there's nothing wrong in the C side and a reminder to self of how things were.
|
if err != nil { | ||
processErr = fmt.Errorf("failed to hash executable %v for PID %v: %w", | ||
process.Filename, process.Pid, err) | ||
ms.log.Warn(processErr.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it should indeed be a warning.
I run it and got a lot of warnings because it could not hash a executable as it's too big:
{"log.level":"warn","@timestamp":"2025-02-20T16:27:21.671+0100","log.logger":"process","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/auditbeat/module/system/process.(*QuarkMetricSet).toEvent","file.name":"process/quark_provider_linux.go","file.line":243},"message":"failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size","service.name":"auditbeat","ecs.version":"1.6.0"}
there are metrics for the process, but with an error:
{"@timestamp":"2025-02-20T15:27:21.671Z","@metadata":{"beat":"auditbeat","type":"_doc","version":"9.1.0"},"process":{"name":"slack","args_count":1,"working_directory":"/home/ainsoph","executable":"/usr/lib/slack/slack","start":"2025-02-18T07:10:02.350Z","interactive":false,"args":["/usr/lib/slack/slack REDACTED"],"pid":7014,"parent":{"pid":6867},"entity_id":"UfvqOXbM/c9hFthu"},"message":"ERROR for PID 7014: failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size","event":{"type":["info"],"action":"existing_process","category":["process"],"kind":"event","module":"system","dataset":"process"},"agent":{"type":"auditbeat","version":"9.1.0","ephemeral_id":"77baa0d8-7b4f-41f3-918e-ce70c0350402","id":"148313e6-6a91-486d-aef7-b9aabd774130","name":"mokona-elastic"},"host":{"ip":["10.60.103.0","fe80::aa75:4504:559b:7cc9","fe80::7c3c:e574:e6d2:26ca","10.80.40.1","fe80::5054:ff:fe08:3f8c","fe80::6c54:c5ff:fe6c:8a1a","fe80::109b:7ff:fe49:d65a","fe80::b8c2:61ff:fe8e:5967","fe80::a0f1:a3ff:fe32:f388","172.21.0.1","172.23.0.1","172.18.0.1","172.19.0.1","fc00:f853:ccd:e793::1","172.17.0.1","fe80::42:2fff:fe2e:32c7","172.22.0.1","172.20.0.1"],"name":"mokona-elastic","mac":["02-42-2F-2E-32-C7","02-42-5B-A8-E4-3C","02-42-63-D2-17-39","02-42-AC-51-3D-F1","02-42-E4-07-FC-F8","02-42-E4-5F-B8-DC","02-42-E5-E1-0F-B9","12-9B-07-49-D6-5A","52-54-00-08-3F-8C","6E-54-C5-6C-8A-1A","7C-21-4A-A3-E2-DD","A2-F1-A3-32-F3-88","A8-4A-63-A7-84-46","BA-C2-61-8E-59-67"],"hostname":"mokona-elastic","architecture":"x86_64","os":{"platform":"pop","version":"22.04 LTS","family":"debian","name":"Pop!_OS","kernel":"6.9.3-76060903-generic","codename":"jammy","type":"linux"},"id":"e9ed195eddb96448172f94386284db62","containerized":false},"user":{"id":1000,"group":{"id":1000,"name":"ainsoph"},"effective":{"id":1000,"group":{"id":1000}},"saved":{"group":{"id":1000},"id":1000},"name":"ainsoph"},"error":{"message":"failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size"},"service":{"type":"system"},"ecs":{"version":"8.0.0"}}
I'm not super familiar with the details of those metrics neither the impact of not being able to hash the executable. So I'm just asking if it's indeed the intended behaviour.
My personal opinion is that if it's expected to have an issue hashing executables at least it should be an info log and perhaps give less importance to this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
I want to revamp the error messages, by now I'm just keeping really 1:1 with the former backend.
It's pretty annoying to get that, it would be nice if we can warn it somehow in another place, or just supress it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I tested and ti works :)
I left some comments, but I don't see them as blockers. It's up to you to decide if they're relevant or not
This introduces a new provider for the sytem/process module in linux. The main motivation is to address some of the limitations of the current implementation. The gosysinfo provider sends state reports by scraping /proc from time to time, so it loses all short lived processes. Some customers also would like to have full telemetry but can't run auditd for various reasons. As a bonus we get some extra ECS fields that were not available before. MAIN DIFFERENCES: * Publishes every process in the system, regardless of lifespan. * Publishes exec events for an existing process (without a fork). * Aggregates fork+exec+exit within one event. * Adds event.exit_code for processes that exited, can't express exit_time in ECS? * Include the original process.args, sysinfo reports args that were fetched when it parsed /proc, so a userland process can masquerade itself. For the initial /proc scraping we report the current value like sysinfo. We can't get the original value since the kernel overwrites it, if you wanna have fun: https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165 * Adds process.args_count. * Adds process.interactive and if true, process.tty.char_device.{major,minor} * Attempts to hash all processes, not just long lived ones. * Hashing is not rate-limited anymore, but it's cached and refreshed based on metadata. It's a LRU keyed by path and refreshed if the metadata of the file changes, statx(2) if the kernel supports, stat(2) otherwise. * No more periodic state reports, only initial batch. * No more saving the timestamp of the last state-report in disk. * No more /proc parsing during runtime, only on boot. MISSING: * Unify entity id with sessionview. * Docs. EXTRA CHANGES: * Added statx(2) to seccomp_linux so we can properly use CachedHasher. * Updated quark to 0.3 so we have namespace inode numbers. Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com> Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co> (cherry picked from commit ce6156b)
Proposed commit message
This introduces a new provider for the sytem/process module in linux.
The main motivation is to address some of the limitations of the current implementation. The gosysinfo provider sends state reports by scraping /proc from time to time, so it loses all short lived processes. Some customers also would like to have full telemetry but can't run auditd for various reasons.
As a bonus we get some extra ECS fields that were not available before.
MAIN DIFFERENCES:
MISSING:
Publish metrics from quark.Stats().Done, but naming and gauges should be discussed.EXTRA CHANGES:
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Run auditbeat on linux with the following configuration:
(edit)
process.backend
wasquark
Related issues
Integrated PRs related to this
List of previous work done to minimize the size of this PR
Screenshots
Non interactive SSH
Below is a shot of a non interactive ssh session, done with

ssh fc39vm /bin/echo hi from quarkio
.It shows the intermediary processes of sshd until we fork the shell and echo, the interesting bits is that we can see a process that forked+execed and then execs again: sshd forks+execs mksh,, which in turn execs /bin/echo, without forking.
Comparison against the sysinfo provider for a long lived process:
Here we run a long sleep and just compare the events against the existing provider on 8.14.3:

On event.type, event.action and others
I've tried to keep things as close as possible to the old provider, but it's really just a suggestion at this point and it's likely we want to change things
As you can see, expressing things in event.action is not great, I'm
all open to suggestions, life would be easier if it could be an
array. I've tried to compromise more states into fewer words.
process_changed_image might look a bit weird, but it's less ambiguous
than "executed". Again really open to suggestions here and I have no
strong feelings about it.
event.kind is now always
event
as there is no more state reports every X seconds.The initial state report at init remains, but it's also
event
.On the state of this PR
This doesn't include the documentation bits, I'd like to do this in a subsequent PR once the naming, config and whatnot is decided.
We should unify process.entity_id with sessionviewer, and we can do it in this PR, worth noting that the gosysinfo backend calculates things differently as well, so this is no worse than that.
I'm going out on holidays, but I'm taking this PR out of draft so that we can start the discussion and interested parties can test it.