feat: Expose detailed JVM and netty memory metrics #61

njhill · 2022-10-01T02:34:23Z

Motivation

To size model-mesh container memory allocation for different workloads, it would be useful to have more detailed usage metrics exposed.

Container process level memory usage alone is opaque and may reflect over-allocation of heap and/or direct buffer pools.

Modifications

Cherry pick some of the prometheus hotspot exporters from https://github.com/prometheus/client_java/tree/main/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot, and reduce the amount of garbage they produce during collection.
Add a netty memory exporter which includes metrics for the amount of OS memory allocated for buffer pools as well as how much of that pool capacity is currently allocated to application buffers (for both heap and direct arenas, although typically only direct should be used).
Enable these by default when prometheus metrics are enabled, but support selective enablement via a mem_detail=X,Y,Z parameter in the MM_METRICS env var string.
Adjust heap/direct memory sizing in start.sh to allow for explicit configuration of max direct memory size
Adjust unit test to check for these new metrics

Result

More detailed memory insight/tuning possible, even in production.

Motivation To size model-mesh container memory allocation for different workloads, it would be useful to have more detailed usage metrics exposed. Container process level memory usage is opaque and may reflect over-allocation of heap and/or direct buffer pools. Modifications - Cherry pick some of the prometheus hotspot exporters from https://github.com/prometheus/client_java/tree/main/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot, and reduce the amount of garbage they produce during collection. - Add a netty memory exporter which includes metrics for the amount of OS memory allocated for buffer pools as well as how much of that pool capacity is currently allocated to application buffers (for both heap and direct arenas, although typically only direct should be used). - Enable these by default when prometheus metrics are enabled, but support selective enablement via a mem_detail=X,Y,Z parameter in the MM_METRICS env var string. - Adjust heap/direct memory sizing in start.sh to allow for explict configuration of max direct memory size - Adjust unit test to check for these new metrics Result More detailed memory insight/tuning possible, even in production. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

kserve-oss-bot · 2022-10-06T21:01:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njhill, rafvasq

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njhill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rafvasq · 2022-10-06T21:13:43Z

/lgtm

Motivation A bug was introduced in recent PR #61 resulting in the netty direct memory pool size not being set correctly. Modification Add missing $ to variable in bash if condition in start.sh Result Correct netty direct memory allocation size, avoid OOM crashing. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

#### Motivation A bug was introduced in recent PR #61 resulting in the netty direct memory pool size not being set correctly. #### Modification Add missing `$` to variable in bash if condition in start.sh #### Result Correct netty direct memory allocation size, avoid OOM crashing. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

[pull] main from kserve:main

kserve-oss-bot added the do-not-merge/work-in-progress label Oct 1, 2022

kserve-oss-bot requested review from chinhuang007 and ckadner October 1, 2022 02:34

kserve-oss-bot added the approved label Oct 1, 2022

njhill force-pushed the mem-metrics branch from a462b88 to 56a94f6 Compare October 3, 2022 16:50

njhill marked this pull request as ready for review October 3, 2022 17:51

kserve-oss-bot removed the do-not-merge/work-in-progress label Oct 3, 2022

rafvasq self-requested a review October 4, 2022 20:57

rafvasq approved these changes Oct 6, 2022

View reviewed changes

kserve-oss-bot assigned rafvasq Oct 6, 2022

kserve-oss-bot added the lgtm label Oct 6, 2022

kserve-oss-bot merged commit e415746 into main Oct 6, 2022

njhill deleted the mem-metrics branch October 13, 2022 18:56

njhill mentioned this pull request Oct 21, 2022

fix: Fix direct memory allocation config #65

Merged

KillianGolds pushed a commit to KillianGolds/modelmesh that referenced this pull request Aug 7, 2024

Merge pull request kserve#61 from kserve/main

620a755

[pull] main from kserve:main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expose detailed JVM and netty memory metrics #61

feat: Expose detailed JVM and netty memory metrics #61

njhill commented Oct 1, 2022 •

edited

Loading

kserve-oss-bot commented Oct 6, 2022

rafvasq commented Oct 6, 2022

feat: Expose detailed JVM and netty memory metrics #61

feat: Expose detailed JVM and netty memory metrics #61

Conversation

njhill commented Oct 1, 2022 • edited Loading

Motivation

Modifications

Result

kserve-oss-bot commented Oct 6, 2022

rafvasq commented Oct 6, 2022

njhill commented Oct 1, 2022 •

edited

Loading