Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart plugging request new metric "smart_device_power_status" #12412

Closed
EcceGratum opened this issue Dec 18, 2022 · 19 comments
Closed

Smart plugging request new metric "smart_device_power_status" #12412

EcceGratum opened this issue Dec 18, 2022 · 19 comments
Labels
area/smart feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins

Comments

@EcceGratum
Copy link

EcceGratum commented Dec 18, 2022

Use Case

This is an extension of #9306.

I would like a new metric "smart_device_power_status" which would return the power state of the drives.
Currently, we have a basic info (on/off) in the label "power" of metric "smart_device_exit_status" but i have two issues with that:

  1. It's difficult to make a non buggy grafana timeseries with that
  2. It only reports 2 states "Active" or "Standby" but there is a lot of inbetween power states that "smartctl" can report (depends on drive)

As far as i know, smartctl reports "ACTIVE or IDLE", "IDLE_A", "IDLE_B", "IDLE_C", "STANDBY". I know there is also "Standby_Y" & "Standby_Z" but i don't know if smartctl report them or just uses "STANDBY" instead.
Some are more usefull than others, like "IDLE_B", the disk parks the heads.

Expected behavior

smart_device_power_status{device="sda",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 5
smart_device_power_status{device="sdb",enabled="",host="92889644d4c0",model="",power="IDLE_B",serial_no="",user="$USER",wwn=""} 2
smart_device_power_status{device="sdc",enabled="",host="92889644d4c0",model="",power="UNKNOWN",serial_no="",user="$USER",wwn=""} -1

Actual behavior

smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="2000398934016",device="sdd",enabled="Enabled",host="92889644d4c0",model="SAMSUNG HD203WI",power="ACTIVE",serial_no="",user="$USER",wwn=""} 0

Additional info

[[inputs.smart]]
use_sudo = true
nocheck = "standby"
devices = [ "hostfs/dev/sda -d ata", "hostfs/dev/sdb -d ata", "hostfs/dev/sdc -d ata", "hostfs/dev/sdd -d ata", "hostfs/dev/sde -d ata", "hostfs/dev/sdf -d ata"]

@EcceGratum EcceGratum added the feature request Requests for new plugin and for new features to existing plugins label Dec 18, 2022
@EcceGratum EcceGratum changed the title Smart pluging request new metric "smart_device_power_status" Smart plugging request new metric "smart_device_power_status" Dec 18, 2022
@Hipska
Copy link
Contributor

Hipska commented Dec 21, 2022

Can you explain the actual and expected behaviour metrics in influx line format please? That will make it more clear what exactly you are requesting..

@Hipska Hipska added area/smart plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Dec 21, 2022
@EcceGratum
Copy link
Author

EcceGratum commented Dec 22, 2022

I am not familiar with that format as i don't use influxdb but should be something like this:
smart_device_power_status,harddrive=/dev/sda status=5 1465839830100400200
or
smart_device_power_status,harddrive=/dev/sda status="STANDBY" 1465839830100400200

The S.M.A.R.T. plugin already exports the data to prometheus as:
smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY"

power needs to be it's own metric and support the intermediary states, not just ACTIVE or STANDBY.
Something like this:
smart_device_power_status{device="sda",host="92889644d4c0"} 5

@Hipska
Copy link
Contributor

Hipska commented Dec 22, 2022

The example output from the smart plugin is like this (according to the docs)

smart_device,enabled=Enabled,host=mbpro.local,device=rdisk0,model=APPLE\ SSD\ SM0512F,serial_no=S1K5NYCD964433,wwn=5002538655584d30,capacity=500277790720 udma_crc_errors=0i,exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=40i 1502536854000000000

I can see a field exit_status and I assume you also want a field power_status? If I can read your prometheus metric correctly, there should also already be a tag power? I can't find that in this current example, so it would help if you paste your current output in influx line format (by using file output for example) and also the output of the corresponding smartctl tool as given in the docs.

@EcceGratum
Copy link
Author

"I can see a field exit_status and I assume you also want a field power_status?"
Yes, "exit_status" is just the returned value when executing the smartctl command.

"If I can read your prometheus metric correctly, there should also already be a tag power?"
Yes, seems to have been added in #9306 but i think it's more of a workaround to know if a drive is spinned down. It's probably based on the value of "exit_status". If your smartctl command starts with "smartctl --nocheck=standby" and the "exit_status" is 2, the drive is spinned down, if it returns 0, it's not, which is better than nothing but we don't see the intermediary power states.

With the file output plugin, all the smart related info in influx line format:
smart_device,device=sdd,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000
smart_device,device=sda,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000
smart_device,capacity=500107862016,device=sdf,enabled=Enabled,host=745557e0062c,model=Samsung\ SSD\ 860\ EVO\ 500GB,power=ACTIVE,serial_no=***************,user=$USER,wwn=5002538e497dfcc4 uncorrectable_errors=0i,temp_c=24i,udma_crc_errors=0i,exit_status=0i,health_ok=true,reallocated_sectors_count=0i,wear_leveling_count=96i 1671988565000000000
smart_device,device=sdb,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000
smart_device,device=sde,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000
smart_device,capacity=4000787030016,device=sdc,enabled=Enabled,host=745557e0062c,model=ST4000VN008-2DR166,power=ACTIVE,serial_no=********,user=$USER,wwn=5000c5009de12d0b reallocated_sectors_count=0i,spin_retry_count=0i,command_timeout=7i,pending_sector_count=0i,uncorrectable_sector_count=0i,health_ok=true,read_error_rate=6300578i,seek_error_rate=5037987897i,end_to_end_error=0i,uncorrectable_errors=0i,temp_c=18i,udma_crc_errors=12i,exit_status=0i 1671988565000000000

@powersj
Copy link
Contributor

powersj commented Jan 5, 2023

It only reports 2 states "Active" or "Standby" but there is a lot of inbetween power states that "smartctl" can report (depends on drive)

Based on this whitepaper the different states are primarily associated with spinning disks.

We currently grab the existing power and standby mode by looking at the output here and then use those discovered values here to set the power tag.

power needs to be it's own metric

Can you share why you think this and why it cannot be another field that parses the power state in more detail?

I am hesitant to modify the smart plugin any further given how fragile the regular expression parsing is.

@powersj powersj added the waiting for response waiting for response from contributor label Jan 5, 2023
@EcceGratum
Copy link
Author

"Based on this whitepaper the different states are primarily associated with spinning disks."
Yes, these are intermediary states between fully active and fully spinned down. Some states indicate head parking and/or slower drive RPM.

In my custom exporter, i get the results from "Device is in " not "Power mode". I am not familiar with the output values of "Power mode". I could look into it if you need.

The parsing of "Device is in" & "Power mode" seems already correct, except the code in "smart.go" uses "Power mode" (which may or may not return the intermediary states) and doesn't care about what was parsed only checks that something was parsed. I would guess that if device is in standby, parsing of "Power mode" would return an empty string.

"power needs to be it's own metric"
That comment only applies to what is exposed from Telegraf to Prometheus. I am guessing that there is some kind of translation layer.
Inside Telegraf, the "smart_device,device=sdd,host=745557e0062c,power=STANDBY,user=$USER exit_status=2i 1671988565000000000" can be reused to describe the others power states, i have no opinion on the matter.

The reason i want power status to be it's on metric in what is exposed to Prometheus is that currently, if i use the :
smart_device_exit_status{capacity="",device="sde",enabled="",host="92889644d4c0",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
in a time serie in Grafana 9, i will get duplicate lines when a transition between power states happen. It seems to only happen on that time serie.

In this Grafana dashboard, we can see that the drives transitioning are consider "Active" and on "Standby" during the transition period, which is odd.
51
And after that it's fine but i get duplicates with no status for the drives that transitioned.
50

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 9, 2023
@Hipska
Copy link
Contributor

Hipska commented Jan 9, 2023

Oh, that last part is just a matter of modifying your query in Grafana, or change the current power tag to a field with the converter processor.

@EcceGratum
Copy link
Author

EcceGratum commented Jan 11, 2023

Tried a few things, none worked but maybe due to my inexperience with grafana (including a "transform" "labels to fields").

When the issue happens, this is what the scraped data from telegraf to prometheus looks like:

smart_device_exit_status{capacity="",device="sda",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdc",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdd",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sde",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="",device="sdf",enabled="",host="ed101bf7913e",model="",power="STANDBY",serial_no="",user="$USER",wwn=""} 2
smart_device_exit_status{capacity="2000398934016",device="sdf",enabled="Enabled",host="ed101bf7913e",model="SAMSUNG HD203WI",power="ACTIVE",serial_no="S1UYJ1CZ700536",user="$USER",wwn="50024e9003bdf2f2"} 0
smart_device_exit_status{capacity="500107862016",device="sdb",enabled="Enabled",host="ed101bf7913e",model="Samsung SSD 860 EVO 500GB",power="ACTIVE",serial_no="S4XBNF0M714935D",user="$USER",wwn="5002538e497dfcc4"} 0

The device "sdf" appears twice for a few seconds when the device transitions (happens with any device). Not the behaviour i would expect or see with other fields.

I will try to use the converter processor but i don't think it will fix that.

But this issue is about getting the other power states.

@Hipska
Copy link
Contributor

Hipska commented Jan 11, 2023

About the other power states, please provide us the smart command and the output of such a different state, so someone can implement this.

About those ‘duplicates’, please provide them in influx line format as the Prometheus format also doesn’t give a timestamp. I’m still convinced this is a matter of doing a correct query in Grafana.

@powersj powersj added the waiting for response waiting for response from contributor label Jan 11, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 16, 2023

The command is "smartctl --nocheck=standby /dev/sda".
The outputs can be "Device is in ACTIVE or IDLE mode", "Device is in IDLE_A mode".

This other command can also be used "smartctl -i --nocheck=standby /dev/sda".
One of the lines of the output is "Power mode is: ACTIVE or IDLE", "Power mode was: IDLE_B", "Power mode was: IDLE_A", etc...
This is information i found with google as i currently can't use smartmontools 7.3 .

I recently moved from windows to linux and noticed that on linux, smartmontools never returns the intermediary power states (IDLE_B, etc...) but on windows it does (on the same drives).

On linux, i use smartmontools 7.2 (release 2020-12-30) . On windows, it was probably 7.3 (release 2022-02-28).
I suspect the brand / model of the drive may also impact this (Seagate works, WD ???) .

I will try to find a way to use the latest version and give you the full output of the commands.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 16, 2023
@Hipska Hipska added help wanted Request for community participation, code, contribution waiting for response waiting for response from contributor labels Jan 16, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 18, 2023

I compiled smartmontools 7.3 and used my script and also called smartctl manually but for some reason, i don't get the intermediary power states... only "active or idle" or "standby".

Just to make sure that the code was not specific to windows, i checked smartmontools 7.3 sources.

For ATA devices, the power mode is requested in file "ataprint.cpp" at line 3337.
The returned int value goes into a switch to select the proper string for the power modes.
Here is the list supported for ATA devices (there is a file for the SCSI devices) :
"SLEEP", "STANDBY", "STANDBY_Y", "IDLE", "IDLE_A", "IDLE_B", "IDLE_C", "ACTIVE_NV_DOWN", "ACTIVE_NV_UP", "ACTIVE or IDLE"

This string is printed to the console at line 3383 (drive is in standby), 3466 (drive is active and "-n" alone was used) or 3707 (i suspect this line to be with "-i -n") in the same file :
l3383: jinf("Device is in %s mode, exit(%d)\n", powername, options.powerexit);
l3466: pout("Device is in %s mode\n", powername);
l3707: pout("Power mode %s %s\n", (powerchg?"was:":"is: "), powername);

with "-n standby" you can get only up to "STANDBY" & "STANDBY_Y".
with "-n idle" you can get up to "IDLE", "IDLE_A", "IDLE_B", "IDLE_C".

By using the Seagate CLI i can force my drives to go into "idle_b" and it appears as such in grafana with my python scripts that relies on "smartctl" on Linux. So the intermediary states are also reported on linux, for some reason my drives never go in to IDLE_A & IDLE_B on linux.

Edit 1: Ok, my drives finally go into intermediary states. I use "-n idle" in telegraf & my custom python script.

"sudo smartctl -n idle /dev/sdc" returns :
"smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device is in IDLE_B mode, exit(2)"

"sudo smartctl -i -n idle /dev/sdc" returns :
"smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Device is in IDLE_B mode, exit(2)"

image

image

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 18, 2023
@powersj
Copy link
Contributor

powersj commented Jan 18, 2023

sudo smartctl -i -n idle /dev/sdc

ok, but this is not what telegraf runs. It should run something to the effect of the following (can't recall off hand how it translates hostfs/dev/sde -d ata or if it uses it raw:

sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde

Can you get the full output and see if you find 'Idle_B' in that output?

@powersj powersj added the waiting for response waiting for response from contributor label Jan 18, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 18, 2023

If Telegraf uses "-n standby" and polls every < 10mins, the disks will never go into IDLE modes.

If Telegraf is disable and you wait until the disk goes into IDLE mode and then run the command :
"sudo smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sda"

you get this :
image

The drive WAS in IDLE mode and is immediately sent into ACTIVE mode.

In order to get the IDLE modes without forcing the drives into active mode, "-n idle" is a requirement.
When running the command when the drive is in IDLE mode :
"sudo smartctl --info --health --attributes --tolerance=verypermissive -n idle --format=brief /dev/sda"

image

most of the info is unavailable with the extra parameters and it falls back to the shortest output.
The drives remain in IDLE mode.

With "-n idle", very few information is available. The power mode is most likely the only info available.
Since there was a "nocheck" option in the config file, i supposed Telegraf already did some reading with it set to "idle".
With your current implementation, seems like a chore to add this.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 18, 2023
@powersj
Copy link
Contributor

powersj commented Jan 19, 2023

If Telegraf uses "-n standby" and polls every < 10mins, the disks will never go into IDLE modes.

Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state?

In order to get the IDLE modes without forcing the drives into active mode, "-n idle" is a requirement.

The purpose of the -n/--nocheck flag is to set what power states smartctl will use in order to prevent smartctl from spinning up the disks.

  • sleep - check the device, which will cause the disks to spin, but skip if the device is in sleep state
  • standby - the same, but skip if the device in sleep or standby states
  • idle - the same, but skip if the device in sleep, standby, or idle states

My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct?

@powersj powersj added the waiting for response waiting for response from contributor label Jan 19, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 19, 2023

"Is that because when telegraf calls smartctl will cause the disks to spin and never enter an idle state ?"
Yes with "-n standby".
No with "-n idle".

"My conclusion reading your last post is that it is not possible to get these states from the devices you have since telegraf will always cause the disks to spin and as such not let the device go idle? Is that correct?"
If Telegraf uses "-n standby", yes the drives will never go into the intermediary idle states.

Does that mean that the "nocheck" option in "telegraf.conf" is not used in the command ?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 19, 2023
@powersj
Copy link
Contributor

powersj commented Jan 19, 2023

Your original issue shows you used a nocheck of standby. I assume you have tried with idle? What output do you get from that?

Does that mean that the "nocheck" option in "telegraf.conf" is not used in the command ?

The value of nocheck provided by the user is set and used here.

@powersj powersj added the waiting for response waiting for response from contributor label Jan 19, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 19, 2023

I can confirm Telegraf correctly parses the power state with option nocheck set to idle.

But, if the drive is in active state and Telegraf is running, the drive will not enter Idle_b or Idle_c.
If the drive is already in Idle_b / idle_c and Telegraf is started, the drive will remain in it's idle state and the power state will be correctly parsed.

When the drive is in active mode, something in Telegraf if preventing it to go into idle mode.

Also the timeserie with the query "smart_device_exit_status" in Grafana shows a bit of a mess with the device used for the test sdc but we can see that the power mode is correctly parsed.

image

NB : I use a docker container.
devices = [ "hostfs/dev/sda -d ata", "hostfs/dev/sdb -d ata", "hostfs/dev/sdc -d ata", "hostfs/dev/sdd -d ata", "hostfs/dev/sde -d ata", "hostfs/dev/sdf -d ata", "hostfs/dev/sdg -d ata"]

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 19, 2023
@powersj
Copy link
Contributor

powersj commented Jan 20, 2023

I can confirm Telegraf correctly parses the power state with option nocheck set to idle.

Awesome

But, if the drive is in active state and Telegraf is running, the drive will not enter Idle_b or Idle_c.

That is what I would expect. Recall my comment above about smartctl's nocheck option. It checks what power states to not spin up the drives. If no check is set to 'idle' then smartctl will only spin up the drives when the drive is in active state. All other states it will not spin up the disk.

You have a disk in an active state, smartctl looks and says ok I can spin the drives to get stats, and so it will. Unless your interval on telegraf is set to something > than the time it takes for the device to go back into idle, you disk will never go idle.

At this point I think we have shown that telegraf can in fact report those idle values and hopefully this explains what is going on with smartctl.

@powersj powersj added the waiting for response waiting for response from contributor label Jan 20, 2023
@EcceGratum
Copy link
Author

EcceGratum commented Jan 20, 2023

I suspect Telegraf may prevent the drives from going into Idle modes with "-n idle" because of the extra attributes in the query ("--health --attributes --tolerance=verypermissive") when the drive is active. My custom script doesn't prevent the active drives from going into idle modes but it's queries are must simpler, "smartctl --nocheck=idle" for the power mode & "smartctl --nocheck=idle -l scttempsts" for the temps but perhaps it's something else.

Anyway, it seems it would need more than a few tweaks in the code to make it work in Telegraf. I will close this issue if your are ok with it.

Thank you for you time.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 20, 2023
@powersj powersj closed this as not planned Won't fix, can't repro, duplicate, stale Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins
Projects
None yet
Development

No branches or pull requests

3 participants