Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psutil version 5.4.7 does not support /proc/diskstats in linux kernels 4.18+ #2707

Closed
vail130 opened this issue Dec 8, 2018 · 5 comments
Closed

Comments

@vail130
Copy link

vail130 commented Dec 8, 2018

Here is the offending line in psutil==5.4.7: https://github.com/giampaolo/psutil/blob/0699c04eb20132dc6f15f69e29565c67d264f00e/psutil/_pslinux.py#L1084

Here is the fixed line in master, which is not in a release yet (newest release right now is 5.4.8): https://github.com/giampaolo/psutil/blob/master/psutil/_pslinux.py#L1079

Output of the info page

Getting the status from the agent.

==============
Agent (v6.7.0)
==============

  Status date: 2018-12-08 13:15:11.801329 UTC
  Pid: 346
  Python Version: 2.7.15
  Logs:
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -1.378ms
    System UTC time: 2018-12-08 13:15:11.801329 UTC

  Host Info
  =========
    bootTime: 2018-12-08 12:44:28.000000 UTC
    kernelVersion: 4.19.0-1.el7.elrepo.x86_64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 72
    uptime: 1m16s
    virtualizationRole: guest
    virtualizationSystem: docker

  Hostnames
  =========
    ec2-hostname: ip-****.ec2.internal
    hostname: gitlab-runner-i-****
    instance-id: i-****
    socket-fqdn: gitlab-autoscaler-utility-****
    socket-hostname: gitlab-autoscaler-utility-****
    hostname provider: configuration

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
        Instance ID: cpu [OK]
        Total Runs: 118
        Metric Samples: 6, Total: 702
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    disk (1.4.0)
    ------------
        Instance ID: disk:9cc7d03419d80578 [ERROR]
        Total Runs: 118
        Metric Samples: 120, Total: 14,160
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 63ms
        Error: not sure how to interpret line ' 259       0 nvme0n1 7088 1 569342 20336 20023 36470 3492694 372963 0 1697310 2068563 0 0 0 0\n'
        Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py", line 366, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/disk/disk.py", line 43, in check
          self.collect_metrics_psutil()
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/disk/disk.py", line 148, in collect_metrics_psutil
          self.collect_latency_metrics()
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/disk/disk.py", line 230, in collect_latency_metrics
          for disk_name, disk in psutil.disk_io_counters(True).iteritems():
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/psutil/__init__.py", line 2017, in disk_io_counters
          rawdict = _psplatform.disk_io_counters(**kwargs)
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/psutil/_pslinux.py", line 1117, in disk_io_counters
          for entry in gen:
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/psutil/_pslinux.py", line 1090, in read_procfs
          raise ValueError("not sure how to interpret line %r" % line)
      ValueError: not sure how to interpret line ' 259       0 nvme0n1 7088 1 569342 20336 20023 36470 3492694 372963 0 1697310 2068563 0 0 0 0\n'

    docker
    ------
        Instance ID: docker [OK]
        Total Runs: 117
        Metric Samples: 34, Total: 3,978
        Events: 0, Total: 4
        Service Checks: 1, Total: 117
        Average Execution Time : 6ms


    file_handle
    -----------
        Instance ID: file_handle [OK]
        Total Runs: 118
        Metric Samples: 5, Total: 590
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    io
    --
        Instance ID: io [OK]
        Total Runs: 117
        Metric Samples: 26, Total: 3,024
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    load
    ----
        Instance ID: load [OK]
        Total Runs: 118
        Metric Samples: 6, Total: 708
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    memory
    ------
        Instance ID: memory [OK]
        Total Runs: 117
        Metric Samples: 17, Total: 1,989
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    network (1.7.0)
    ---------------
        Instance ID: network:2a218184ebe03606 [OK]
        Total Runs: 118
        Metric Samples: 20, Total: 2,360
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    ntp
    ---
        Instance ID: ntp:b4579e02d1981c12 [OK]
        Total Runs: 118
        Metric Samples: 1, Total: 118
        Events: 0, Total: 0
        Service Checks: 1, Total: 118
        Average Execution Time : 0s


    uptime
    ------
        Instance ID: uptime [OK]
        Total Runs: 118
        Metric Samples: 1, Total: 118
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 117
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 11
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 245
  TimeseriesV1: 117

  API Keys status
  ===============
    API key ending with **** on endpoint https://app.datadoghq.com: API Key valid

=========
DogStatsD
=========

  Checks Metric Sample: 29,868
  Event: 5
  Events Flushed: 5
  Number Of Flushes: 117
  Series Flushed: 30,705
  Service Check: 1,296
  Service Checks Flushed: 1,402
  Dogstatsd Metric Sample: 1,933

Additional environment details (Operating System, Cloud provider, etc):

  • AWS EC2 running datadog agent in Docker continer
docker run \
    -e DD_API_KEY=$_dd_api_key \
    -e DD_APM_ENABLED=true \
    -e DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true \
    -e DD_HOSTNAME=$_name-$_instance_id \
    -e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=false \
    -e DD_LOGS_ENABLED=true \
    -e DD_PROCESS_AGENT_ENABLED=true \
    -e \"DD_TAGS=env:$_environment region:$_region az:$_az app:$_name\" \
    --log-driver=awslogs \
    --log-opt awslogs-group=$_log_group_name \
    --log-opt awslogs-region=$_region \
    --log-opt awslogs-stream=$_instance_id-datadog \
    --name datadog-agent \
    --net host \
    --privileged \
    -v /opt/datadog-agent/conf.d:/conf.d:ro \
    -v /opt/datadog-agent/run:/opt/datadog-agent/run:rw \
    -v /proc/:/host/proc/:ro \
    -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
    -v /var/run/docker.sock:/var/run/docker.sock:ro \
    datadog/agent:$_dd_version

Steps to reproduce the issue:

  1. Upgrade linux to kernel 4.18+
  2. Use datadog disk integration

Describe the results you received:
disk integration error

Describe the results you expected:
No disk integration error

Additional information you deem important (e.g. issue happens only occasionally):

@ofek
Copy link
Contributor

ofek commented Dec 10, 2018

@vail130 Thanks for the heads-up! I've begun watching their repo for releases and when I get a notification I'll update our pinned version straight away.

@vail130
Copy link
Author

vail130 commented Jan 2, 2019

@ofek I was just taking a look at this issue: giampaolo/psutil#1354

Is there any way to update to the latest git version rather than PyPI? If not, I understand, but they have fixed the issue in master and seem to be very slow to cut a new release.

Thanks!

@aerostitch
Copy link
Contributor

aerostitch commented Jan 15, 2019

I'm guessing the only way to get that fixed in the docker image right now is to build a custom image, just adding sed -i 's#flen == 14:#flen == 14 or flen == 18:#g' /opt/datadog-agent/embedded/lib/python2.7/site-packages/psutil/_pslinux.py before starting the agent right?

There's no plan on the Datadog side to get that fixed sooner than when psutils gets the time to release the new version which would have the patch from over 2 months ago in it right?

Right now it's a bit problematic as we don't have disk monitoring anymore...

@rmoriz
Copy link

rmoriz commented Jan 18, 2019

Since we have some machines running 4.20 datadog started to logflood because of this issue. please cut a release with the updated psutil ASAP. Thanks.

@ofek ofek mentioned this issue Jan 23, 2019
@KarthikRangaraju
Copy link

@ofek I'm still seeing this issue in datadog 7.18.1 docker image.

    disk (2.7.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 17
      Metric Samples: Last Run: 672, Total: 11,424
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 308ms
      Last Execution Date : 2020-04-27 21:20:26.000000 UTC
      Last Successful Execution Date : Never
      Error: not sure how to interpret line '   8       0 sda 45066 9000 10004532 512126 138525639 193915661 3435741544 440462636 0 127530527 368668748 0 0 0 0 0 0\n'
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py", line 713, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/disk/disk.py", line 121, in check
          self.collect_latency_metrics()
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/disk/disk.py", line 244, in collect_latency_metrics
          for disk_name, disk in iteritems(psutil.disk_io_counters(True)):
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/__init__.py", line 2168, in disk_io_counters
          rawdict = _psplatform.disk_io_counters(**kwargs)
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/_pslinux.py", line 1125, in disk_io_counters
          for entry in gen:
        File "/opt/datadog-agent/embedded/lib/python3.8/site-packages/psutil/_pslinux.py", line 1098, in read_procfs
          raise ValueError("not sure how to interpret line %r" % line)
      ValueError: not sure how to interpret line '   8       0 sda 45066 9000 10004532 512126 138525639 193915661 3435741544 440462636 0 127530527 368668748 0 0 0 0 0 0\n'
root@9eabbd05b5df:/opt/datadog-agent/embedded/lib/python3.8/site-packages# ls | grep psutil
psutil
psutil-5.6.7.dist-info

It uses 5.6.7 ps util and in your PR #3019 you seem to have updated it to 5.5.0. Although I'm at a higher version, I'm still facing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants