Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPDV-6166: Purge old unused buffers #75

Merged
merged 9 commits into from
Feb 16, 2024

Conversation

martin-majlis-s1
Copy link
Collaborator

@martin-majlis-s1 martin-majlis-s1 commented Feb 14, 2024

Jira Link: https://sentinelone.atlassian.net/browse/DPDV-6166

🥅 Goal

Free resources that are no longer used. When new session is encountered some resources are allocated. They should be freed if no data has been received for that session for some time.

🛠️ Solution

Introduce new configuration option - PurgeOlderThan - that specifies when the resources related to the unused sessions should be cleared.

This PR is also introducing new metrics - sessions_opened and sessions_closed - that allow people to monitor how quickly they are creating and destroying their sessions.

🏫 Testing

  • All the tests are passing
  • Run stress test:
    • cd examples/stress
    • I have run it agains version 0.17.0 and changes in my master.
    • go build -o stress
    • /usr/bin/time -al ./stress --events=100000 2>&1 | tee out-10-100000-0.18.log
    • /usr/bin/time -al ./stress --events=100000 2>&1 | tee out-11-100000-0.17.log

Raw Data:

log-0.17.0-1708079725718.log
log-0.18.0-1708079597306.log
stress-17-vs-18.ods
Screenshot 2024-02-16 at 12 01 40

🔗 Links

@martin-majlis-s1
Copy link
Collaborator Author

In the current version - v0.17.0:

86      1707992444      3000    3000    3000    33370896        78643200        1136105 1010052 126053  0.17.0
87      1707992445      3000    3000    3000    33371624        78643200        1136123 1010056 126067  0.17.0
88      1707992446      3000    3000    3000    33372352        78643200        1136141 1010060 126081  0.17.0
89      1707992447      3000    3000    3000    33373080        78643200        1136159 1010064 126095  0.17.0
90      1707992448      3000    3000    3000    33376488        78643200        1136202 1010068 126134  0.17.0
91      1707992449      3000    3000    3000    33465688        78643200        1136251 1010073 126178  0.17.0
92      1707992450      3000    3000    3000    33466416        78643200        1136269 1010077 126192  0.17.0
93      1707992451      3000    3000    3000    33467144        78643200        1136287 1010081 126206  0.17.0
94      1707992452      3000    3000    3000    33467872        78643200        1136305 1010085 126220  0.17.0
95      1707992453      3000    3000    3000    33468600        78643200        1136323 1010089 126234  0.17.0
96      1707992454      3000    3000    3000    33557800        78643200        1136372 1010094 126278  0.17.0

In the fixed version:

86      1707998026      3000    3000    3000    33805128        75661312        1189192 1018029 171163  0.18.0
87      1707998027      3000    3000    3000    33805872        75661312        1189210 1018033 171177  0.18.0
88      1707998028      3000    3000    3000    33806616        75661312        1189228 1018037 171191  0.18.0
89      1707998029      3000    3000    3000    33807360        75661312        1189246 1018041 171205  0.18.0
90      1707998030      3000    3000    3000    33811320        75661312        1189296 1018045 171251  0.18.0
91      1707998031      3000    3000    3000    33813248        75661312        1189330 1018050 171280  0.18.0
92      1707998032      3000    3000    3000    33813992        75661312        1189348 1018054 171294  0.18.0
93      1707998033      3000    3000    3000    33814736        75661312        1189366 1018058 171308  0.18.0
94      1707998034      3000    3000    3000    33815480        75661312        1189384 1018062 171322  0.18.0
95      1707998035      3000    3000    3000    33816224        75661312        1189402 1018066 171336  0.18.0

There is no difference :/

@codecov-commenter
Copy link

codecov-commenter commented Feb 15, 2024

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (e747bcf) 76.68% compared to head (0e4036f) 75.95%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #75      +/-   ##
==========================================
- Coverage   76.68%   75.95%   -0.73%     
==========================================
  Files          12       12              
  Lines        1977     2096     +119     
==========================================
+ Hits         1516     1592      +76     
- Misses        380      419      +39     
- Partials       81       85       +4     
Files Coverage Δ
pkg/client/add_events.go 82.19% <100.00%> (-1.76%) ⬇️
pkg/buffer/buffer.go 46.18% <0.00%> (-0.34%) ⬇️
pkg/statistics/statistics.go 81.65% <82.35%> (+0.09%) ⬆️
pkg/buffer_config/buffer_config.go 51.33% <50.00%> (-0.52%) ⬇️
pkg/client/client.go 85.13% <82.56%> (-1.72%) ⬇️

@tomaz-s1
Copy link
Collaborator

Thanks for handling this.

Overall approach looks good to me (purge inactive sessions which haven't received any data for X amount of time).

A couple of quick questions / comments:

  1. Since groupBy function is deterministic and always returns the same output for the same inputs (
    func extractKeyAndUpdateInfo(bundle *add_events.EventBundle, groupBy []string, key string, info add_events.SessionInfo) string {
    ), this means that even if a session becomes inactive and we purge it from memory, next time if / when we receive data for the same host + logfile + other group by attributes, we will re-use the same session id, right?
  2. I wonder what would be a good default value for PurgeOlderThan. I could imagine there could be sessions which receive data infrequently, perhaps as a pathological case, just slightly longer than PurgeOlderThan seconds. In case there are a lot of sessions like that, this internal session purge + create churn may or may not present a problem. Perhaps to be able to make a better decision, we should try to time and understand how long a single session create + session purge cycle takes. If it's indeed very fast and adds very little overhead, we can probably keep it low (e.g. < 5 minutes) by default.

@martin-majlis-s1
Copy link
Collaborator Author

@tomaz-s1 :
ad 1) Yes, session id is deterministic, the same key will be used in the future. I should add a test for this to make sure, that it still works!
ad 2) Since it's just removing few entries from the hashmap, I have set it to 30s. Do you suggest larger value?

@martin-majlis-s1
Copy link
Collaborator Author

v0.17:

          3259777024  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              199611  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
              200000  messages sent
              200003  messages received
               67899  signals received
              224093  voluntary context switches
             4121274  involuntary context switches
        697631126962  instructions retired
        413668860659  cycles elapsed
          3244170880  peak memory footprint

v0.18:

           179142656  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               11170  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
              200000  messages sent
              200002  messages received
               97798  signals received
               85899  voluntary context switches
             3127390  involuntary context switches
        578955892824  instructions retired
        388075215153  cycles elapsed
           171853888  peak memory footprint

Screenshot 2024-02-16 at 12 01 40

@tomaz-s1
Copy link
Collaborator

@martin-majlis-s1 Great, thanks.

Since that default value is end user configurable, I think it's fine to start with 30s and then adjust if / when needed.

Copy link
Collaborator

@tomaz-s1 tomaz-s1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@martin-majlis-s1 martin-majlis-s1 added this pull request to the merge queue Feb 16, 2024
Merged via the queue into main with commit 22ff6db Feb 16, 2024
10 checks passed
@martin-majlis-s1 martin-majlis-s1 deleted the DPDV-6166-prune-old-sessions branch February 16, 2024 12:48
jpkrohling pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Feb 19, 2024
…31293)

**Description:** Release unused resources after some time

This PR is:
* upgrading the used library from v0.17.0 to v0.18.0
* introducing new configuration option - `buffer.purge_older_than`

**Link to tracking Issue:** #31292

**Testing:** 

Issue has been in the underlying library -
scalyr/dataset-go#75 - where I have fixed the
issue.

![Screenshot 2024-02-16 at 12 01
40](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/122797378/f1c80807-69de-49c4-aa62-7edd211e3b34)


**Documentation:** I have added documentation to the newly added
configuration option - `buffer.purge_older_than`.

Fixes #31292
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
…pen-telemetry#31293)

**Description:** Release unused resources after some time

This PR is:
* upgrading the used library from v0.17.0 to v0.18.0
* introducing new configuration option - `buffer.purge_older_than`

**Link to tracking Issue:** open-telemetry#31292

**Testing:** 

Issue has been in the underlying library -
scalyr/dataset-go#75 - where I have fixed the
issue.

![Screenshot 2024-02-16 at 12 01
40](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/122797378/f1c80807-69de-49c4-aa62-7edd211e3b34)


**Documentation:** I have added documentation to the newly added
configuration option - `buffer.purge_older_than`.

Fixes open-telemetry#31292
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants