Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia dash #19074

Merged
merged 4 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions nvidia_nim/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Overview

This check monitors [Nvidia NIM][1] through the Datadog Agent.
This check monitors [NVIDIA NIM][1] through the Datadog Agent.

## Setup

Expand All @@ -12,15 +12,15 @@ Follow the instructions below to install and configure this check for an Agent r

### Installation

The Nvidia NIM check is included in the [Datadog Agent][2] package. No additional installation is needed on your server.
The NVIDIA NIM check is included in the [Datadog Agent][2] package. No additional installation is needed on your server.

### Configuration

Nvidia NIM provides Prometheus metrics indicating request statistics. By default, these metrics are available at http://localhost:8000/metrics. The Datadog Agent can collect the exposed metrics using this integration. Follow the instructions below to configure data collection from any or all of the components.
NVIDIA NIM provides Prometheus metrics indicating request statistics. By default, these metrics are available at http://localhost:8000/metrics. The Datadog Agent can collect the exposed metrics using this integration. Follow the instructions below to configure data collection from any or all of the components.

**Note**: This check uses [OpenMetrics][10] for metric collection, which requires Python 3.

1. Edit the `nvidia_nim.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your Nvidia NIM performance data. See the [sample nvidia_nim.d/conf.yaml][4] for all available configuration options.
1. Edit the `nvidia_nim.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your NVIDIA NIM performance data. See the [sample nvidia_nim.d/conf.yaml][4] for all available configuration options.

2. [Restart the Agent][5].

Expand All @@ -36,11 +36,11 @@ See [metadata.csv][7] for a list of metrics provided by this integration.

### Events

The Nvidia NIM integration does not include any events.
The NVIDIA NIM integration does not include any events.

### Service Checks

The Nvidia NIM integration does not include any service checks.
The NVIDIA NIM integration does not include any service checks.

See [service_checks.json][8] for a list of service checks provided by this integration.

Expand Down
2 changes: 1 addition & 1 deletion nvidia_nim/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ files:
openmetrics_endpoint.required: true
openmetrics_endpoint.value.example: http://localhost:8000/metrics
openmetrics_endpoint.description: |
Endpoint exposing the Nvidia NIM's Prometheus metrics. For more information refer to:
Endpoint exposing the NVIDIA NIM's Prometheus metrics. For more information refer to:
https://docs.nvidia.com/nim/large-language-models/latest/observability.html
1,162 changes: 1,161 additions & 1 deletion nvidia_nim/assets/dashboards/nvidia_nim_overview.json

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions nvidia_nim/assets/monitors/latency.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
"created_at": "2024-07-02",
"last_updated_at": "2024-07-02",
"title": "Average Request Latency is High",
"description": "This monitor alerts you if Nvidia NIM request latency is too high. High latency means requests are waiting long to be processed. This results in clients having to wait longer for their requests to complete. It also indicates your Nvidia NIM server is receiving more requests than it can comfortably handle.",
"description": "This monitor alerts you if NVIDIA request latency is too high. High latency means requests are waiting long to be processed. This results in clients having to wait longer for their requests to complete. It also indicates your NVIDIA server is receiving more requests than it can comfortably handle.",
"tags": [
"integration:nvidia-nim"
],
"definition": {
"name": "Average request latency is high",
"type": "query alert",
"query": "sum(last_15m):sum:nvidia_nim.e2e_request_latency.seconds.sum{*}.as_count() / sum:nvidia_nim.e2e_request_latency.seconds.count{*}.as_count() > 0.3",
"message": "The average latency for requests coming into your Nvidia NIM instance is higher than the threshold. This means requests are waiting too long to be processed.",
"message": "The average latency for requests coming into your NVIDIA instance is higher than the threshold. This means requests are waiting too long to be processed.",
"tags": [
"integration:nvidia_nim"
],
Expand Down
6 changes: 3 additions & 3 deletions nvidia_nim/assets/service_checks.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[
{
"agent_version": "7.61.0",
"integration": "Nvidia NIM",
"integration": "nvidia_nim",
"check": "nvidia_nim.openmetrics.health",
"statuses": [
"ok",
Expand All @@ -11,7 +11,7 @@
"host",
"endpoint"
],
"name": "Nvidia NIM OpenMetrics endpoint health",
"description": "Returns `CRITICAL` if the Agent is unable to connect to the Nvidia NIM OpenMetrics endpoint, otherwise returns `OK`."
"name": "NVIDIA NIM OpenMetrics endpoint health",
"description": "Returns `CRITICAL` if the Agent is unable to connect to the NVIDIA NIM OpenMetrics endpoint, otherwise returns `OK`."
}
]
2 changes: 1 addition & 1 deletion nvidia_nim/datadog_checks/nvidia_nim/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def _submit_version_metadata(self):
}
self.set_metadata('version', version_raw, scheme='semver', part_map=version_parts)
else:
self.log.debug("Invalid Nvidia NIM release format: %s", version)
self.log.debug("Invalid NVIDIA NIM release format: %s", version)

def check(self, instance):
super().check(instance)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ init_config:
instances:

## @param openmetrics_endpoint - string - required
## Endpoint exposing the Nvidia NIM's Prometheus metrics. For more information refer to:
## Endpoint exposing the NVIDIA NIM's Prometheus metrics. For more information refer to:
## https://docs.nvidia.com/nim/large-language-models/latest/observability.html
#
- openmetrics_endpoint: http://localhost:8000/metrics
Expand Down
1 change: 0 additions & 1 deletion nvidia_nim/datadog_checks/nvidia_nim/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@
'request_prompt_tokens': 'request.prompt_tokens',
'request_success': 'request.success',
'request_failure': 'request.failure',

}

RENAME_LABELS_MAP = {
Expand Down
3 changes: 3 additions & 0 deletions nvidia_nim/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@
"vllm_nvext.entrypoints.openai.api_server"
]
},
"dashboards": {
"NVIDIA NIM Overview": "assets/dashboards/nvidia_nim_overview.json"
},
"monitors": {
"Average Request Latency is High": "assets/monitors/latency.json"
}
Expand Down
Loading