Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool status always returns true #4670

Closed
jimsalterjrs opened this issue May 19, 2016 · 10 comments
Closed

zpool status always returns true #4670

jimsalterjrs opened this issue May 19, 2016 · 10 comments

Comments

@jimsalterjrs
Copy link

jimsalterjrs commented May 19, 2016

zpool status returns true, even on a completely faulted pool. This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

root@locutus:/data/test# zpool status test && echo $?
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Thu May 19 16:30:13 2016
config:

    NAME        STATE     READ WRITE CKSUM
    test        DEGRADED     0     0     0
      mirror-0  DEGRADED     0     0     0
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: No known data errors
0

In the above example, a DEGRADED pool's status still returns 0.

root@banshee:~# zpool status -x test ; echo $?
  pool: test
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub in progress since Thu May 19 16:45:22 2016
    187M scanned out of 187M at 2.13M/s, 0h0m to go
    0 repaired, 99.87% done
config:

    NAME        STATE     READ WRITE CKSUM
    test        UNAVAIL      0     0     0  insufficient replicas
      mirror-0  UNAVAIL      0     0     0  insufficient replicas
        nbd0    UNAVAIL      0     0     0  corrupted data
        nbd1    UNAVAIL      0     0     0  corrupted data
      mirror-1  ONLINE       0     0     0
        nbd2    ONLINE       0     0     0
        nbd3    ONLINE       0     0     0

errors: 1539 data errors, use '-v' for a list
0

And here we see even a completely UNAVAIL pool still returning 0 from a status check.

I would really, really like to see zpool status returning a parseable exit code. An additional option for text output in a stable format designed for machine parsing (as well as predictable exit codes) would be even better.

@loli10K
Copy link
Contributor

loli10K commented May 20, 2016

Out of curiosity i tried to reproduce the same problem on illumos: its implementation of zpool status seems to behave the same way ZoL does:

[root@smartos]# zpool status -x dozer && echo $?
  pool: dozer
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 20 10:38:59 2016
config:

        NAME                      STATE     READ WRITE CKSUM
        dozer                     UNAVAIL      0     0     0  insufficient replicas
          mirror-0                UNAVAIL      0     0     0  insufficient replicas
            9499654404053574007   UNAVAIL      0     0     0  was /dev/dsk/c4t0d0s0
            c3t0d0                REMOVED      0     0     0
          mirror-1                DEGRADED     0     0     0
            c2t0d0                ONLINE       0     0     0
            15981958288646757095  FAULTED      0     0     0  was /dev/dsk/c5t0d0s0

errors: No known data errors
0

This means that automated monitors and alerts are forced to rely on parsing text (which can change without warning) to evaluate pool health.

The latest FreeNAS release does exaclty that: https://github.com/freenas/freenas/blob/FN-9.10-RELEASE/gui/middleware/notifier.py#L5660

I don't know if we could just implement this and call it a day, maybe it should be discussed with the other OpenZFS members.

@behlendorf
Copy link
Contributor

This is actually by design. The exit code indicates that the command returned without error not that the pool is healthy. You're going to need to parse the output to determine the status, this should get easier when the JSON support in #3938 is finalized. We could consider adding another command line option to change this behavior or a new sub-command. But I'd rather not change the expected long standing default behavior.

@jimsalterjrs
Copy link
Author

The problem I have with this is that the output format changes without
warning. In fact this has already happened once - last year I woke up to
100 bogus critical pool health warnings from my nagios network because of a
change in the text of zpool status after an automatic upgrade. :-\


(Sent from my phone - please blame any weird errors on autocorrect)

On May 20, 2016 1:10:59 PM Brian Behlendorf notifications@github.com wrote:

This is actually by design. The exit code indicates that the command
returned without error not that the pool is healthy. You're going to need
to parse the output to determine the status, this should get easier when
the JSON support in #3938 is finalized. We could consider adding another
command line option to change this behavior or a new sub-command. But I'd
rather not change the expected long standing default behavior.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#4670 (comment)

@behlendorf
Copy link
Contributor

Then let's add a reliable interface. That could be the JSON output which is structured and won't change or something else.

@drescherjm
Copy link

I would like this for the same reason. I also use nagios to monitor my servers.

@richardelling
Copy link
Contributor

Try: zpool get health

@rlaager
Copy link
Member

rlaager commented May 23, 2016

If you have the pool name, this is even simpler: zpool list -H -o health pool

@rlaager rlaager closed this as completed May 23, 2016
@seb314
Copy link

seb314 commented Aug 8, 2021

does "zpool get health" actually work for this?
after intentionally corrupting a zpool and subsequent scrub, I get

# zpool status
  pool: testpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h0m with 6517 errors on Sun Aug  8 20:43:45 2021
config:

        NAME                        STATE     READ WRITE CKSUM
        testpool                    ONLINE       0     0     0
          /home/server/testzfsblob  ONLINE       0     0 12,8K

errors: 6478 data errors, use '-v' for a list

but

# zpool get health ; echo $?
NAME            PROPERTY  VALUE   SOURCE
testpool        health    ONLINE  -
0

and

# zpool list -H -o health testpool ; echo $?
ONLINE
0

@richardelling
Copy link
Contributor

richardelling commented Aug 8, 2021 via email

@goertzenator
Copy link

This little bash fragment seems to do the trick:

(( $(zpool status -x | wc -l) < 2 ))

An aside: I note that btrfs has the --check switch that causes device errors to be encoded into the exit code. For example: btrfs device stats --check / would return non-zero if there were device errors. It would be nice if zfs had this, but I heartily agree that it should not be default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants