Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auditbeat: system/process module backed by quark #42032

Merged
merged 40 commits into from
Feb 20, 2025
Merged

auditbeat: system/process module backed by quark #42032

merged 40 commits into from
Feb 20, 2025

Conversation

haesbaert
Copy link
Contributor

@haesbaert haesbaert commented Dec 13, 2024

Proposed commit message

This introduces a new provider for the sytem/process module in linux.

The main motivation is to address some of the limitations of the current implementation. The gosysinfo provider sends state reports by scraping /proc from time to time, so it loses all short lived processes. Some customers also would like to have full telemetry but can't run auditd for various reasons.

As a bonus we get some extra ECS fields that were not available before.

MAIN DIFFERENCES:

  • Publishes every process in the system, regardless of lifespan.
  • Publishes exec events for an existing process (without a fork).
  • Aggregates fork+exec+exit within one event.
  • Adds event.exit_code for processes that exited, can't express exit_time in ECS?
  • Include the original process.args, sysinfo reports args that were fetched when it parsed /proc, so a userland process can masquerade itself. For the initial /proc scraping we report the current value like sysinfo. We can't get the original value since the kernel overwrites it, if you wanna have fun: https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165
  • Adds process.args_count.
  • Adds process.interactive and if true, process.tty.char_device.{major,minor}
  • Attempts to hash all processes, not just long lived ones.
  • Hashing is not rate-limited anymore, but it's cached and refreshed based on metadata. It's a LRU keyed by path and refreshed if the metadata of the file changes, statx(2) if the kernel supports, stat(2) otherwise.
  • No more periodic state reports, only initial batch.
  • No more saving the timestamp of the last state-report in disk.
  • No more /proc parsing during runtime, only on boot.

MISSING:

  • Unify entity id with sessionview.
  • Publish metrics from quark.Stats(). Done, but naming and gauges should be discussed.
  • Docs.
  • Properly define config options and names.

EXTRA CHANGES:

  • Added statx(2) to seccomp_linux so we can properly use CachedHasher.
  • Updated quark to 0.3 so we have namespace inode numbers.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run auditbeat on linux with the following configuration:

auditbeat.modules:

- module: system
  datasets:
    - process
  process.backend: "kernel_tracing"

(edit) process.backend was quark

Related issues

Integrated PRs related to this

List of previous work done to minimize the size of this PR

Screenshots

Non interactive SSH

Below is a shot of a non interactive ssh session, done with ssh fc39vm /bin/echo hi from quarkio.
It shows the intermediary processes of sshd until we fork the shell and echo, the interesting bits is that we can see a process that forked+execed and then execs again: sshd forks+execs mksh,, which in turn execs /bin/echo, without forking.
ssh_nonint

Comparison against the sysinfo provider for a long lived process:

Here we run a long sleep and just compare the events against the existing provider on 8.14.3:
vs_old

On event.type, event.action and others

I've tried to keep things as close as possible to the old provider, but it's really just a suggestion at this point and it's likely we want to change things

event.type gosysinfo quark
fork start start
fork+exec start [start, change]
short fork+exec+exit N/A [start, change, end]
short fork+exit N/A [start, end]
existing processes info info
exec only N/A change
exec+exit end [change, end]
event.action gosysinfo quark
fork process_started process_started
fork+exec process_started process_started
short fork+exec+exit N/A process_ran
short fork+exit N/A process_ran
existing processes existing_process existing_process
exec only N/A process_changed_image
exec+exit end process_stopped

As you can see, expressing things in event.action is not great, I'm
all open to suggestions, life would be easier if it could be an
array. I've tried to compromise more states into fewer words.
process_changed_image might look a bit weird, but it's less ambiguous
than "executed". Again really open to suggestions here and I have no
strong feelings about it.

event.kind is now always event as there is no more state reports every X seconds.
The initial state report at init remains, but it's also event.

On the state of this PR

This doesn't include the documentation bits, I'd like to do this in a subsequent PR once the naming, config and whatnot is decided.
We should unify process.entity_id with sessionviewer, and we can do it in this PR, worth noting that the gosysinfo backend calculates things differently as well, so this is no worse than that.

I'm going out on holidays, but I'm taking this PR out of draft so that we can start the discussion and interested parties can test it.

This introduces a new provider for the sytem/process module in linux.

The main motivation is to address some of the limitations of the current
implementation. The gosysinfo provider sends state reports by scraping /proc
from time to time, so it loses all short lived processes. Some customers also
would like to have full telemetry but can't run auditd for various reasons.

As a bonus we get some extra ECS fields that were not available before.

MAIN DIFFERENCES:
 * Publishes every process in the system, regardless of lifespan.
 * Publishes exec events for an existing process (without a fork).
 * Aggregates fork+exec+exit within one event.
 * Adds event.exit_code for processes that exited, can't express exit_time in ECS?
 * Include the original process.args, sysinfo reports args that were
   fetched when it parsed /proc, so a userland process can masquerade
   itself. For the initial /proc scraping we report the current value like
   sysinfo. We can't get the original value since the kernel
   overwrites it, if you wanna have fun:
   https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165
 * Adds process.args_count.
 * Adds process.interactive and if true, process.tty.char_device.{major,minor}
 * Attempts to hash all processes, not just long lived ones.
 * Hashing is not rate-limited anymore, but it's cached and refreshed
   based on metadata. It's a LRU keyed by path and refreshed if the
   metadata of the file changes, statx(2) if the kernel supports,
   stat(2) otherwise.
 * No more periodic state reports, only initial batch.
 * No more saving the timestamp of the last state-report in disk.
 * No more /proc parsing during runtime, only on boot.

MISSING:
 * Unify entity id with sessionview.
 * Publish metrics from quark.Stats().
 * Docs.
 * Properly define config options and names.

EXTRA CHANGES:
 * Added statx(2) to seccomp_linux so we can properly use CachedHasher.
 * Updated quark to 0.3 so we have namespace inode numbers.
@haesbaert haesbaert added enhancement Team:Security-Linux Platform Linux Platform Team in Security Solution labels Dec 13, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 13, 2024
Copy link
Contributor

mergify bot commented Dec 13, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @haesbaert? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 13, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 13, 2024
@haesbaert haesbaert marked this pull request as ready for review December 13, 2024 11:11
@haesbaert haesbaert requested review from a team as code owners December 13, 2024 11:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

if err := c.HasherConfig.Validate(); err != nil {
return err
}
if c.Backend != "quark" && c.Backend != "proc" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with add_session_metadata in terms of backend option names.

Copy link
Contributor Author

@haesbaert haesbaert Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I've changed it to kernel_tracing like the processor, I was hoping we could discuss the naming but it makes more sense to stick with the same for now.

Copy link
Contributor

mergify bot commented Jan 8, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b quark-process upstream/quark-process
git merge upstream/main
git push upstream quark-process

@haesbaert haesbaert marked this pull request as draft January 9, 2025 15:09
haesbaert and others added 3 commits January 10, 2025 08:34
Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 18, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@haesbaert
Copy link
Contributor Author

quark-0.3 pulls testify 1.10, which reveals this bug in filebeat: #34870 (comment)

@haesbaert haesbaert removed the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 18, 2025
@haesbaert
Copy link
Contributor Author

I'm more or less ready to commit this, as there were some main merges since I did the heavy testing I just want to redo them.
Basically I let it run for a day on fork bombs to check if it behaves.

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haesbaert is there any change you forgot to commit a file? I tried to run it, but it does not accept the quark backend:

{"log.level":"error","@timestamp":"2025-02-20T16:05:54.355+0100","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1589},"message":"Exiting: 1 error: failed to unpack the system/process config: invalid process.backend 'quark' accessing 'auditbeat.modules.2' (source:'auditbeat-anderson.yml')","service.name":"auditbeat","ecs.version":"1.6.0"}

I had a quick look and it seems the config does not accept the quark backed

// Validate validates the config.
func (c *Config) Validate() error {
if err := c.HasherConfig.Validate(); err != nil {
return err
}
if c.Backend != "kernel_tracing" && c.Backend != "procfs" {
return fmt.Errorf("invalid process.backend '%s'", c.Backend)
}
return nil
}

this is the config I used:

http:
  enabled: true

auditbeat.modules:

- module: system
  datasets:
    - process
  process.backend: "quark"

path.home: /tmp/beat

output.file:
  path: /tmp/beat/
  filename: auditbeat-output-file
  rotate_every_kb: 10000

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

logging.level: info
monitoring.enabled: true
logging.metrics:
  - enabled:


var userString string
if len(username) > 0 {
userString = fmt.Sprintf(" by user %v", username)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion | question]
Why to use %v for string?

Comment on lines +151 to +152
return fmt.Sprintf("Process %v (PID: %d)%v %v",
name, pid, userString, actionString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion | question]
Why to use %v for string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really my code, this is what the processor was already doing, I just unified the call:
https://github.com/elastic/beats/blob/main/x-pack/auditbeat/module/system/process/gosysinfo_provider.go#L373-L374

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to change the output of all of those, it looks a bit odd, but I'd rather address all that in a future PR that also changes gosysinfo.

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just until we clarify it runs. as you can see in my comment above, I cannot run it

@haesbaert
Copy link
Contributor Author

just until we clarify it runs. as you can see in my comment above, I cannot run it

the keyword changed to "kernel_tracing" instead of "quark" for the backend, so if you change your config it should work.

e3b7332
#42032 (comment)

@haesbaert
Copy link
Contributor Author

haesbaert commented Feb 20, 2025

@haesbaert is there any change you forgot to commit a file? I tried to run it, but it does not accept the quark backend:

{"log.level":"error","@timestamp":"2025-02-20T16:05:54.355+0100","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1589},"message":"Exiting: 1 error: failed to unpack the system/process config: invalid process.backend 'quark' accessing 'auditbeat.modules.2' (source:'auditbeat-anderson.yml')","service.name":"auditbeat","ecs.version":"1.6.0"}

I had a quick look and it seems the config does not accept the quark backed

// Validate validates the config.
func (c *Config) Validate() error {
if err := c.HasherConfig.Validate(); err != nil {
return err
}
if c.Backend != "kernel_tracing" && c.Backend != "procfs" {
return fmt.Errorf("invalid process.backend '%s'", c.Backend)
}
return nil
}

this is the config I used:

http:
  enabled: true

auditbeat.modules:

- module: system
  datasets:
    - process
  process.backend: "quark"

path.home: /tmp/beat

output.file:
  path: /tmp/beat/
  filename: auditbeat-output-file
  rotate_every_kb: 10000

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

logging.level: info
monitoring.enabled: true
logging.metrics:
  - enabled:

Correct, it changed to "kernel_tracing", it will try quark on ebpf, if it fails, it will try quark on kprobe, if it fails it fallsback to gosysinfo.
so changing to:

process.backend: "kernel_tracing"

should work

@AndersonQ
Copy link
Member

How to test this PR locally

ah, ok. could you please update the "How to test this PR" section?

@haesbaert
Copy link
Contributor Author

How to test this PR locally

ah, ok. could you please update the "How to test this PR" section?

zefixed, thanks for testing :)

@haesbaert
Copy link
Contributor Author

This is the output of valgrind after ~6h of quark-mon, running together with auditbeat, just to make sure there's nothing wrong in the C side and a reminder to self of how things were.

==47391== HEAP SUMMARY:
==47391==     in use at exit: 0 bytes in 0 blocks
==47391==   total heap usage: 5,326,567 allocs, 5,326,567 frees, 4,295,175,100 bytes allocated

if err != nil {
processErr = fmt.Errorf("failed to hash executable %v for PID %v: %w",
process.Filename, process.Pid, err)
ms.log.Warn(processErr.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it should indeed be a warning.
I run it and got a lot of warnings because it could not hash a executable as it's too big:

{"log.level":"warn","@timestamp":"2025-02-20T16:27:21.671+0100","log.logger":"process","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/auditbeat/module/system/process.(*QuarkMetricSet).toEvent","file.name":"process/quark_provider_linux.go","file.line":243},"message":"failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size","service.name":"auditbeat","ecs.version":"1.6.0"}

there are metrics for the process, but with an error:

{"@timestamp":"2025-02-20T15:27:21.671Z","@metadata":{"beat":"auditbeat","type":"_doc","version":"9.1.0"},"process":{"name":"slack","args_count":1,"working_directory":"/home/ainsoph","executable":"/usr/lib/slack/slack","start":"2025-02-18T07:10:02.350Z","interactive":false,"args":["/usr/lib/slack/slack REDACTED"],"pid":7014,"parent":{"pid":6867},"entity_id":"UfvqOXbM/c9hFthu"},"message":"ERROR for PID 7014: failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size","event":{"type":["info"],"action":"existing_process","category":["process"],"kind":"event","module":"system","dataset":"process"},"agent":{"type":"auditbeat","version":"9.1.0","ephemeral_id":"77baa0d8-7b4f-41f3-918e-ce70c0350402","id":"148313e6-6a91-486d-aef7-b9aabd774130","name":"mokona-elastic"},"host":{"ip":["10.60.103.0","fe80::aa75:4504:559b:7cc9","fe80::7c3c:e574:e6d2:26ca","10.80.40.1","fe80::5054:ff:fe08:3f8c","fe80::6c54:c5ff:fe6c:8a1a","fe80::109b:7ff:fe49:d65a","fe80::b8c2:61ff:fe8e:5967","fe80::a0f1:a3ff:fe32:f388","172.21.0.1","172.23.0.1","172.18.0.1","172.19.0.1","fc00:f853:ccd:e793::1","172.17.0.1","fe80::42:2fff:fe2e:32c7","172.22.0.1","172.20.0.1"],"name":"mokona-elastic","mac":["02-42-2F-2E-32-C7","02-42-5B-A8-E4-3C","02-42-63-D2-17-39","02-42-AC-51-3D-F1","02-42-E4-07-FC-F8","02-42-E4-5F-B8-DC","02-42-E5-E1-0F-B9","12-9B-07-49-D6-5A","52-54-00-08-3F-8C","6E-54-C5-6C-8A-1A","7C-21-4A-A3-E2-DD","A2-F1-A3-32-F3-88","A8-4A-63-A7-84-46","BA-C2-61-8E-59-67"],"hostname":"mokona-elastic","architecture":"x86_64","os":{"platform":"pop","version":"22.04 LTS","family":"debian","name":"Pop!_OS","kernel":"6.9.3-76060903-generic","codename":"jammy","type":"linux"},"id":"e9ed195eddb96448172f94386284db62","containerized":false},"user":{"id":1000,"group":{"id":1000,"name":"ainsoph"},"effective":{"id":1000,"group":{"id":1000}},"saved":{"group":{"id":1000},"id":1000},"name":"ainsoph"},"error":{"message":"failed to hash executable /usr/lib/slack/slack for PID 7014: size 185134296 exceeds max file size"},"service":{"type":"system"},"ecs":{"version":"8.0.0"}}

I'm not super familiar with the details of those metrics neither the impact of not being able to hash the executable. So I'm just asking if it's indeed the intended behaviour.

My personal opinion is that if it's expected to have an issue hashing executables at least it should be an info log and perhaps give less importance to this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!
I want to revamp the error messages, by now I'm just keeping really 1:1 with the former backend.
It's pretty annoying to get that, it would be nice if we can warn it somehow in another place, or just supress it.

Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested and ti works :)
I left some comments, but I don't see them as blockers. It's up to you to decide if they're relevant or not

@haesbaert haesbaert merged commit ce6156b into main Feb 20, 2025
143 checks passed
@haesbaert haesbaert deleted the quark-process branch February 20, 2025 16:19
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
This introduces a new provider for the sytem/process module in linux.

The main motivation is to address some of the limitations of the current
implementation. The gosysinfo provider sends state reports by scraping /proc
from time to time, so it loses all short lived processes. Some customers also
would like to have full telemetry but can't run auditd for various reasons.

As a bonus we get some extra ECS fields that were not available before.

MAIN DIFFERENCES:
 * Publishes every process in the system, regardless of lifespan.
 * Publishes exec events for an existing process (without a fork).
 * Aggregates fork+exec+exit within one event.
 * Adds event.exit_code for processes that exited, can't express exit_time in ECS?
 * Include the original process.args, sysinfo reports args that were
   fetched when it parsed /proc, so a userland process can masquerade
   itself. For the initial /proc scraping we report the current value like
   sysinfo. We can't get the original value since the kernel
   overwrites it, if you wanna have fun:
   https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165
 * Adds process.args_count.
 * Adds process.interactive and if true, process.tty.char_device.{major,minor}
 * Attempts to hash all processes, not just long lived ones.
 * Hashing is not rate-limited anymore, but it's cached and refreshed
   based on metadata. It's a LRU keyed by path and refreshed if the
   metadata of the file changes, statx(2) if the kernel supports,
   stat(2) otherwise.
 * No more periodic state reports, only initial batch.
 * No more saving the timestamp of the last state-report in disk.
 * No more /proc parsing during runtime, only on boot.

MISSING:
 * Unify entity id with sessionview.
 * Docs.

EXTRA CHANGES:
 * Added statx(2) to seccomp_linux so we can properly use CachedHasher.
 * Updated quark to 0.3 so we have namespace inode numbers.

Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>
(cherry picked from commit ce6156b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Security-Linux Platform Linux Platform Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants