zpool status always returns true #4670

jimsalterjrs · 2016-05-19T20:52:29Z

zpool status returns true, even on a completely faulted pool. This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

root@locutus:/data/test# zpool status test && echo $?
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:30:13 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        DEGRADED     0     0     0
      mirror-0  DEGRADED     0     0     0
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors
0

In the above example, a DEGRADED pool's status still returns 0.

root@banshee:~# zpool status -x test ; echo $?
  pool: test
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub in progress since Thu May 19 16:45:22 2016
    187M scanned out of 187M at 2.13M/s, 0h0m to go
    0 repaired, 99.87% done
config:

    NAME        STATE     READ WRITE CKSUM
    test        UNAVAIL      0     0     0  insufficient replicas
      mirror-0  UNAVAIL      0     0     0  insufficient replicas
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    UNAVAIL      0     0     0  corrupted data
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: 1539 data errors, use '-v' for a list
0

And here we see even a completely UNAVAIL pool still returning 0 from a status check.

I would really, really like to see zpool status returning a parseable exit code. An additional option for text output in a stable format designed for machine parsing (as well as predictable exit codes) would be even better.

The text was updated successfully, but these errors were encountered:

loli10K · 2016-05-20T09:03:51Z

Out of curiosity i tried to reproduce the same problem on illumos: its implementation of zpool status seems to behave the same way ZoL does:

[root@smartos]# zpool status -x dozer && echo $?
  pool: dozer
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 20 10:38:59 2016
config:

        NAME                      STATE     READ WRITE CKSUM
        dozer                     UNAVAIL      0     0     0  insufficient replicas
          mirror-0                UNAVAIL      0     0     0  insufficient replicas
            9499654404053574007   UNAVAIL      0     0     0  was /dev/dsk/c4t0d0s0
            c3t0d0                REMOVED      0     0     0
          mirror-1                DEGRADED     0     0     0
            c2t0d0                ONLINE       0     0     0
            15981958288646757095  FAULTED      0     0     0  was /dev/dsk/c5t0d0s0

errors: No known data errors
0

This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

The latest FreeNAS release does exaclty that: https://github.com/freenas/freenas/blob/FN-9.10-RELEASE/gui/middleware/notifier.py#L5660

I don't know if we could just implement this and call it a day, maybe it should be discussed with the other OpenZFS members.

behlendorf · 2016-05-20T17:10:41Z

This is actually by design. The exit code indicates that the command returned without error not that the pool is healthy. You're going to need to parse the output to determine the status, this should get easier when the JSON support in #3938 is finalized. We could consider adding another command line option to change this behavior or a new sub-command. But I'd rather not change the expected long standing default behavior.

jimsalterjrs · 2016-05-20T17:21:02Z

The problem I have with this is that the output format changes without
warning. In fact this has already happened once - last year I woke up to
100 bogus critical pool health warnings from my nagios network because of a
change in the text of zpool status after an automatic upgrade. :-\

(Sent from my phone - please blame any weird errors on autocorrect)

On May 20, 2016 1:10:59 PM Brian Behlendorf notifications@github.com wrote:

This is actually by design. The exit code indicates that the command
returned without error not that the pool is healthy. You're going to need
to parse the output to determine the status, this should get easier when
the JSON support in #3938 is finalized. We could consider adding another
command line option to change this behavior or a new sub-command. But I'd
rather not change the expected long standing default behavior.

You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#4670 (comment)

behlendorf · 2016-05-20T17:35:35Z

Then let's add a reliable interface. That could be the JSON output which is structured and won't change or something else.

drescherjm · 2016-05-20T18:21:33Z

I would like this for the same reason. I also use nagios to monitor my servers.

richardelling · 2016-05-20T19:34:43Z

Try: zpool get health

rlaager · 2016-05-23T22:28:44Z

If you have the pool name, this is even simpler: zpool list -H -o health pool

seb314 · 2021-08-08T18:55:25Z

does "zpool get health" actually work for this?
after intentionally corrupting a zpool and subsequent scrub, I get

# zpool status
  pool: testpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h0m with 6517 errors on Sun Aug  8 20:43:45 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        testpool                    ONLINE       0     0     0
          /home/server/testzfsblob  ONLINE       0     0 12,8K

errors: 6478 data errors, use '-v' for a list

but

# zpool get health ; echo $?
NAME            PROPERTY  VALUE   SOURCE
testpool        health    ONLINE  -
0

and

# zpool list -H -o health testpool ; echo $?
ONLINE
0

richardelling · 2021-08-08T23:25:37Z

Do not try to rely on the zpool status return code to be meaningful, it isn't. -- richard

…

On Sun, Aug 8, 2021 at 11:55 AM seb314 ***@***.***> wrote: does "zpool get health" actually work for this? after intentionally corrupting a zpool and subsequent scrub, I get # zpool status pool: testpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0h0m with 6517 errors on Sun Aug 8 20:43:45 2021 config: NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 /home/server/testzfsblob ONLINE 0 0 12,8K errors: 6478 data errors, use '-v' for a list but # zpool get health ; echo $? NAME PROPERTY VALUE SOURCE testpool health ONLINE - 0 and # zpool list -H -o health testpool ; echo $? ONLINE 0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4670 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGTZTM7OIUEN2MUFHKXOE3T33HKRANCNFSM4CEKUF3A> .

goertzenator · 2022-06-29T18:01:33Z

This little bash fragment seems to do the trick:

(( $(zpool status -x | wc -l) < 2 ))

An aside: I note that btrfs has the --check switch that causes device errors to be encoded into the exit code. For example: btrfs device stats --check / would return non-zero if there were device errors. It would be nice if zfs had this, but I heartily agree that it should not be default behavior.

rlaager closed this as completed May 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zpool status always returns true #4670

zpool status always returns true #4670

jimsalterjrs commented May 19, 2016 •

edited

Loading

loli10K commented May 20, 2016

behlendorf commented May 20, 2016

jimsalterjrs commented May 20, 2016

behlendorf commented May 20, 2016

drescherjm commented May 20, 2016

richardelling commented May 20, 2016

rlaager commented May 23, 2016

seb314 commented Aug 8, 2021

richardelling commented Aug 8, 2021 via email

goertzenator commented Jun 29, 2022

zpool status always returns true #4670

zpool status always returns true #4670

Comments

jimsalterjrs commented May 19, 2016 • edited Loading

loli10K commented May 20, 2016

behlendorf commented May 20, 2016

jimsalterjrs commented May 20, 2016

behlendorf commented May 20, 2016

drescherjm commented May 20, 2016

richardelling commented May 20, 2016

rlaager commented May 23, 2016

seb314 commented Aug 8, 2021

richardelling commented Aug 8, 2021 via email

goertzenator commented Jun 29, 2022

jimsalterjrs commented May 19, 2016 •

edited

Loading