Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMART does not separately tag sub-devices under a single device (megaraid) #6284

Closed
mattster98 opened this issue Aug 20, 2019 · 9 comments
Closed
Labels
area/smart bug unexpected problem or unintended behavior

Comments

@mattster98
Copy link

mattster98 commented Aug 20, 2019

Relevant telegraf.conf:

devices = [ "/dev/bus/0 -d megaraid,0", "/dev/bus/0 -d megaraid,1" ]

System info:

Telegraf 1.11.4 (git: HEAD d9ca76e)
Linux r820 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-48-generic] (local build)

Steps to reproduce:

  1. Issue SMART does not found device behind Raid-Controller #4881 prevents SMART from automatically including the devices underneath a raid controller.
  2. Manually list the devices so that they are included.
  3. Since the top-level device (/dev/bus/0) is the same, the values gathered for the two sub-devices get rolled up under the same device tag, making it impossible to report on them separately.

Expected behavior:

The two distinct devices would get somehow tagged to make them distinct. "0-0" and "0-1" for my given config snippet, for example.

Actual behavior:

The only device that shows up is "0", and as best I can tell, the values for both devices are recorded against that device.

Additional info:

I tried this workaround but haven't had any luck. Either my syntax is wrong or it's ignoring the tag specified.

[[inputs.smart]]
   path = "/usr/sbin/smartctl"
   use_sudo = true
   attributes = false
   devices = [ "/dev/nvme0n1 -d nvme" ]

[[inputs.smart]]
   path = "/usr/sbin/smartctl"
   use_sudo = true
   attributes = false
   devices = [ "/dev/bus/0 -d megaraid,0" ]
   [inputs.smart.tags]
      device = "0"

[[inputs.smart]]
   path = "/usr/sbin/smartctl"
   use_sudo = true
   attributes = false
   devices = [ "/dev/bus/0 -d megaraid,1" ]
   [inputs.smart.tags]
      device = "1"
@mattster98
Copy link
Author

My influxdbsql is weak, but I hacked together what I think confirms that it's recording both values against the same tag:

> select "temp_c" from "smart_device" where "device" =~ /^(0)$/ and time >= now() - 2m;
name: smart_device
time                temp_c
----                ------
1566307230000000000 25
1566307230000000000 26
1566307240000000000 25
1566307240000000000 26
1566307250000000000 25
1566307250000000000 26
1566307260000000000 25
1566307260000000000 26
1566307270000000000 26
1566307270000000000 26
1566307280000000000 26
1566307280000000000 26
1566307290000000000 25
1566307290000000000 26
>

@danielnelson
Copy link
Contributor

Easiest way to check how we are recording the values is with:

telegraf --input-filter smart --test

Once the values hit the database the later data will overwrite earlier if the measurement+tagset+field+timestamp is the same, you can only have one value for each combination of these. Since above there are 2 values for temp_c at the same timestamp, we know there must be at least one tag that differs.

With the workaround try grouping by disk and device, or for debugging it can be useful to group by '*':

select temp_c from smart_device where time >= now() - 2m group by *;

@danielnelson danielnelson added area/smart bug unexpected problem or unintended behavior labels Aug 21, 2019
@mattster98
Copy link
Author

That's super helpful - thanks! Yes, the serial number and WWN are unique which explains why there's two separate values for the same timestamp.

@mattster98
Copy link
Author

Interesting side-effect, again maybe due to bad syntax on my part, but when I use the above config (added attributes=true for the nvme), it just reports the last device in the config file three times rather than reporting three distinct devices. It does add the "disk" tag to two of the three! If I reorder them, the last one is reported 3 times.

Is this a separate bug?

[serial numbers replaced with dashes]

$ sudo telegraf --input-filter smart --test | grep -i temp                                                    [9:58:24]
2019-08-22T13:58:27Z I! Starting Telegraf 1.11.4
2019-08-22T13:58:27Z I! Using config file: /etc/telegraf/telegraf.conf
> smart_attribute,device=nvme0n1,disk=0,host=r820,id=194,name=Temperature_Celsius,serial_no=-------- raw_value=31i 1566482308000000000
> smart_device,device=nvme0n1,disk=0,host=r820,model=INTEL\ SSDPEDME016T4S,serial_no=---------------- exit_status=0i,health_ok=true,temp_c=31i 1566482308000000000
> smart_attribute,device=nvme0n1,disk=1,host=r820,id=194,name=Temperature_Celsius,serial_no=------------------ raw_value=31i 1566482308000000000
> smart_device,device=nvme0n1,disk=1,host=r820,model=INTEL\ SSDPEDME016T4S,serial_no=----------- exit_status=0i,health_ok=true,temp_c=31i 1566482308000000000
> smart_attribute,device=nvme0n1,host=r820,id=194,name=Temperature_Celsius,serial_no=------------------- raw_value=31i 1566482308000000000
> smart_device,device=nvme0n1,host=r820,model=INTEL\ SSDPEDME016T4S,serial_no=-------------- exit_status=0i,health_ok=true,temp_c=31i 1566482308000000000

@danielnelson
Copy link
Contributor

It seems like another bug, though I'm not able to reproduce this on my, all SATA, system:

[[inputs.smart]]
  devices = ["/dev/sda"]
  attributes = false
  use_sudo = true

[[inputs.smart]]
  devices = ["/dev/sdb"]
  attributes = false
  use_sudo = true
  [inputs.smart.tags]
    disk = "0"

[[inputs.smart]]
  devices = ["/dev/sdc"]
  attributes = false
  use_sudo = true
  [inputs.smart.tags]
    disk = "1"
> smart_device,capacity=500107862016,device=sda,enabled=Enabled,host=loki,model=Samsung\ SSD\ 850\ EVO\ 500GB,serial_no=S21HNXAGB00873F,wwn=5002538d4075fd17 exit_status=0i,health_ok=true,temp_c=27i,udma_crc_errors=0i 1566536286000000000
> smart_device,capacity=500107862016,device=sdb,disk=0,enabled=Enabled,host=loki,model=Samsung\ SSD\ 850\ EVO\ 500GB,serial_no=S2RANB0J505626W,wwn=5002538d4200c738 exit_status=0i,health_ok=true,temp_c=28i,udma_crc_errors=0i 1566536286000000000
> smart_device,capacity=640135028736,device=sdc,disk=1,enabled=Enabled,host=loki,model=WDC\ WD6400AAKS-00A7B2,serial_no=WD-WMASY7276305,wwn=50014ee0abd36d7e exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=34i,udma_crc_errors=0i 1566536286000000000

@mattster98
Copy link
Author

mattster98 commented Aug 23, 2019

Interesting.. I'm able to reproduce this problem on both of my Dell servers. One running 1.7.2 and the other 1.11.4.

Neither are just plain SATA. One is built-in RAID, plus the PCIe SSD, and the other has HBA interfaces to disk arrays and whatnot. Not sure how that would affect the output when the plugin is run in parallel though.

I'll file a separate bug.

@glinton
Copy link
Contributor

glinton commented Aug 26, 2019

@mattster98 can you test this with a nightly build, I'm wondering if the cause of this is similar to your other bug.

@reimda
Copy link
Contributor

reimda commented Aug 8, 2022

Hi @mattster98 is this still a problem for you? Is it still happening with recent releases of telegraf?

@reimda reimda added the waiting for response waiting for response from contributor label Aug 8, 2022
@mattster98
Copy link
Author

This does appear to be breaking out by serial number now. Thanks! Tested with 1.23.3

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants