-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart plugging request new metric "smart_device_power_status" #12412
Comments
Can you explain the actual and expected behaviour metrics in influx line format please? That will make it more clear what exactly you are requesting.. |
I am not familiar with that format as i don't use influxdb but should be something like this: The S.M.A.R.T. plugin already exports the data to prometheus as: power needs to be it's own metric and support the intermediary states, not just ACTIVE or STANDBY. |
The example output from the smart plugin is like this (according to the docs)
I can see a field |
"I can see a field exit_status and I assume you also want a field power_status?" "If I can read your prometheus metric correctly, there should also already be a tag power?" With the file output plugin, all the smart related info in influx line format: |
Based on this whitepaper the different states are primarily associated with spinning disks. We currently grab the existing power and standby mode by looking at the output here and then use those discovered values here to set the power tag.
Can you share why you think this and why it cannot be another field that parses the power state in more detail? I am hesitant to modify the smart plugin any further given how fragile the regular expression parsing is. |
"Based on this whitepaper the different states are primarily associated with spinning disks." In my custom exporter, i get the results from "Device is in " not "Power mode". I am not familiar with the output values of "Power mode". I could look into it if you need. The parsing of "Device is in" & "Power mode" seems already correct, except the code in "smart.go" uses "Power mode" (which may or may not return the intermediary states) and doesn't care about what was parsed only checks that something was parsed. I would guess that if device is in standby, parsing of "Power mode" would return an empty string. "power needs to be it's own metric" The reason i want power status to be it's on metric in what is exposed to Prometheus is that currently, if i use the : In this Grafana dashboard, we can see that the drives transitioning are consider "Active" and on "Standby" during the transition period, which is odd. |
Oh, that last part is just a matter of modifying your query in Grafana, or change the current |
Tried a few things, none worked but maybe due to my inexperience with grafana (including a "transform" "labels to fields"). When the issue happens, this is what the scraped data from telegraf to prometheus looks like:
The device "sdf" appears twice for a few seconds when the device transitions (happens with any device). Not the behaviour i would expect or see with other fields. I will try to use the converter processor but i don't think it will fix that. But this issue is about getting the other power states. |
About the other power states, please provide us the smart command and the output of such a different state, so someone can implement this. About those ‘duplicates’, please provide them in influx line format as the Prometheus format also doesn’t give a timestamp. I’m still convinced this is a matter of doing a correct query in Grafana. |
The command is "smartctl --nocheck=standby /dev/sda". This other command can also be used "smartctl -i --nocheck=standby /dev/sda". I recently moved from windows to linux and noticed that on linux, smartmontools never returns the intermediary power states (IDLE_B, etc...) but on windows it does (on the same drives). On linux, i use smartmontools 7.2 (release 2020-12-30) . On windows, it was probably 7.3 (release 2022-02-28). I will try to find a way to use the latest version and give you the full output of the commands. |
I compiled smartmontools 7.3 and used my script and also called smartctl manually but for some reason, i don't get the intermediary power states... only "active or idle" or "standby". Just to make sure that the code was not specific to windows, i checked smartmontools 7.3 sources. For ATA devices, the power mode is requested in file "ataprint.cpp" at line 3337. This string is printed to the console at line 3383 (drive is in standby), 3466 (drive is active and "-n" alone was used) or 3707 (i suspect this line to be with "-i -n") in the same file : with "-n standby" you can get only up to "STANDBY" & "STANDBY_Y". By using the Seagate CLI i can force my drives to go into "idle_b" and it appears as such in grafana with my python scripts that relies on "smartctl" on Linux. So the intermediary states are also reported on linux, for some reason my drives never go in to IDLE_A & IDLE_B on linux. Edit 1: Ok, my drives finally go into intermediary states. I use "-n idle" in telegraf & my custom python script. "sudo smartctl -n idle /dev/sdc" returns : Device is in IDLE_B mode, exit(2)" "sudo smartctl -i -n idle /dev/sdc" returns : Device is in IDLE_B mode, exit(2)" |
ok, but this is not what telegraf runs. It should run something to the effect of the following (can't recall off hand how it translates
Can you get the full output and see if you find 'Idle_B' in that output? |
If Telegraf uses "-n standby" and polls every < 10mins, the disks will never go into IDLE modes. If Telegraf is disable and you wait until the disk goes into IDLE mode and then run the command : The drive WAS in IDLE mode and is immediately sent into ACTIVE mode. In order to get the IDLE modes without forcing the drives into active mode, "-n idle" is a requirement. most of the info is unavailable with the extra parameters and it falls back to the shortest output. With "-n idle", very few information is available. The power mode is most likely the only info available. |
Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state?
The purpose of the -n/--nocheck flag is to set what power states smartctl will use in order to prevent smartctl from spinning up the disks.
My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct? |
"Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state ?" "My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct?" Does that mean that the "nocheck" option in "telegraf.conf" is not used in the command ? |
Your original issue shows you used a nocheck of
The value of |
I can confirm Telegraf correctly parses the power state with option But, if the drive is in active state and Telegraf is running, the drive will not enter Idle_b or Idle_c. When the drive is in active mode, something in Telegraf if preventing it to go into idle mode. Also the timeserie with the query "smart_device_exit_status" in Grafana shows a bit of a mess with the device used for the test NB : I use a docker container. |
Awesome
That is what I would expect. Recall my comment above about smartctl's You have a disk in an active state, smartctl looks and says ok I can spin the drives to get stats, and so it will. Unless your interval on telegraf is set to something > than the time it takes for the device to go back into idle, you disk will never go idle. At this point I think we have shown that telegraf can in fact report those idle values and hopefully this explains what is going on with smartctl. |
I suspect Telegraf may prevent the drives from going into Idle modes with "-n idle" because of the extra attributes in the query ("--health --attributes --tolerance=verypermissive") when the drive is active. My custom script doesn't prevent the active drives from going into idle modes but it's queries are must simpler, "smartctl --nocheck=idle" for the power mode & "smartctl --nocheck=idle -l scttempsts" for the temps but perhaps it's something else. Anyway, it seems it would need more than a few tweaks in the code to make it work in Telegraf. I will close this issue if your are ok with it. Thank you for you time. |
Use Case
This is an extension of #9306.
I would like a new metric "smart_device_power_status" which would return the power state of the drives.
Currently, we have a basic info (on/off) in the label "power" of metric "smart_device_exit_status" but i have two issues with that:
As far as i know, smartctl reports "ACTIVE or IDLE", "IDLE_A", "IDLE_B", "IDLE_C", "STANDBY". I know there is also "Standby_Y" & "Standby_Z" but i don't know if smartctl report them or just uses "STANDBY" instead.
Some are more usefull than others, like "IDLE_B", the disk parks the heads.
Expected behavior
smart_device_power_status{device="sda",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 5
smart_device_power_status{device="sdb",enabled="",host="92889644d4c0",model="",power="IDLE_B",serial_no="",user="$USER",wwn=""} 2
smart_device_power_status{device="sdc",enabled="",host="92889644d4c0",model="",power="UNKNOWN",serial_no="",user="$USER",wwn=""} -1
Actual behavior
smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="2000398934016",device="sdd",enabled="Enabled",host="92889644d4c0",model="SAMSUNG HD203WI",power="ACTIVE",serial_no="",user="$USER",wwn=""} 0
Additional info
[[inputs.smart]]
use_sudo = true
nocheck = "standby"
devices = [ "hostfs/dev/sda -d ata", "hostfs/dev/sdb -d ata", "hostfs/dev/sdc -d ata", "hostfs/dev/sdd -d ata", "hostfs/dev/sde -d ata", "hostfs/dev/sdf -d ata"]
The text was updated successfully, but these errors were encountered: