Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent unreliable zpool mounting on boot #2444

Closed
montanaviking opened this issue Jun 30, 2014 · 19 comments
Closed

Intermittent unreliable zpool mounting on boot #2444

montanaviking opened this issue Jun 30, 2014 · 19 comments
Milestone

Comments

@montanaviking
Copy link

Hi,
I've been configuring a new system as specified below:
dual Xeon E5-2630-v2
SuperMicro X9DRI-F-O motherboard with 64GB ECC RAM
OS installed on the Intel 530 series 120GB SSD - on a 60GB partition
The OS is Ubuntu 14.04 64-bit
mdadm was installed
ZFS on Linux was installed (Kernel version of course)
the /home directory was installed on a ZFS RAIDZ2 pool using four 1TB drives i.e. 2 seagate Barracuda and 2 Western Digital blacks.
The zpool was mounted automatically via zfs automount.
After installation, I noticed that the Ubuntu 14.04 would occasionally fail to mount the zpool on boot. When this happened, it could still be mounted after booting.
Strangely, when the OS (Ubuntu 14.04 root directory was moved from the SSD to a regular hard drive (1TB Seagate Barracuda) Zpool mounting appeared to be reliable.
I've gone back to a Ubuntu 12.04 install on my SSD with ZFS on Linux (as configured above) and now see no apparent issues.
Of course, I would really like to get Ubuntu 14.04 working.
Thanks,
Phil

@behlendorf behlendorf added this to the 0.6.4 milestone Jun 30, 2014
@behlendorf behlendorf added the Bug label Jun 30, 2014
@ryao
Copy link
Contributor

ryao commented Jul 2, 2014

@dajhorn This sounds like a race in upstart's initialization of the system.

@dajhorn
Copy link
Contributor

dajhorn commented Jul 2, 2014

This is #330. Section 4.3 of the FAQ applies here.

@behlendorf
Copy link
Contributor

Closing as a duplicate of the above referenced issues.

@montanaviking
Copy link
Author

Hi Darik,
I was looking at the script you mentioned (#330) and I had a couple questions namely:
while [ true ]; do
ls /dev/disk/by-id/scsi-* >/dev/null 2>&1 && break
the above code it seems, would break out of the loop whenever just one of the disks came online and its file appeared in the /dev/disk/by-id directory. What if one or more of the raidZ array doesn't come up on time? It seems to me that the above code could allow the code to proceed to the rest of the mountall command before all the disks were up and running (i.e. their files appearing in /dev/disk/by-id) and in this case, one could possibly still encounter the intermittent raidZ mount even with this code in place?
Wouldn't one have to write the above code to remain in the loop until all the drives in the RAIDZ appeared in the /dev/disk/by-id directory?
Please forgive my ignorance but it's likely I'm missing something here in my understanding.
Thanks,
Phil

@dajhorn
Copy link
Contributor

dajhorn commented Aug 30, 2014

@montanaviking, I don't remember that code or our conversation, but anything that spins in a while loop is probably inadequate. A general solution for ticket #330 must do at least two things:

  • Recognize zpool members when they are attached to the system. (eg: When a slow HBA bring disks online, or when a USB disk is plugged in.)
  • Implement policy knobs for import. (eg: How long do we wait for all pool members to appear? Do we import degraded pools?)

You have the right idea, but you're not going to get the desired result unless ZoL is hooked into udev/dbus/mountall/upstart/systemd in all the right places, which is a non-trivial amount of integration work.

@montanaviking
Copy link
Author

Hi Darik,
Thanks for your advice. For now, I would be satisfied to get a patch that's reliable and I'm thinking of implementing the loop in the FAQ link above i.e. https://github.com/zfsonlinux/pkg-zfs/wiki/Ubuntu-ZFS-mountall-FAQ-and-troubleshooting#43--jbod-mode-on-hardware-raid-implementations

with the exception of putting in an if statement that terminates the loop only when all my RAIDZ2 drives appear (as entries) in /dev/disk/by-id/xxxx where xxxxx is a drive ID of a member of the RAIDZ2 array.
Again, please pardon my ignorance but when a drive's entry appears in /dev/disk/by-id does this guarantee it will be ready for ZFS? I think that disks entries appear on boot in /dev/disk/by-id only after they've been recognized by the OS but I might be wrong here?

Does putting this loop in /etc/init/mountall.conf stall ZFS automount until the loop terminates (from the description in the FAQ I think it does)?

So, I'm thinking that if I set up a loop in /etc/init/mountall.conf as suggested in the FAQ above and I modify it to make the loop continue until all my RAIDZ2 drives appear in /dev/disk/by-id then I should be reasonably assured that I'll avoid the conditions that made mounting the RAIDZ2 /home directory unreliable? I'm also thinking that should one of my drives actually fail on or before boot that the loop would terminate anyway after say 60 tries and the RAIDZ2 could still mount with up to two failed drives.

Right now, I have the lines:
zfs mount -a
in my /etc/rc.local
file to mitigate the automount problem but it's not completely reliable.

Finally, if I may ask, is the line:
for file in /sys/block/sd* /sys/block/sd_/sd_; do udevadm test $file; done >/dev/null 2>&1
essential to the fix in the FAQ? I'm thinking that this just gives some additional information on the drivces to the user.
I'm not entirely sure of what the following function in the FAQ script does? (again my ignorance):
failsafedbus() {
sleep 5
/sbin/start dbus
/sbin/start tty2
}

Thanks so much,
Phil

@dajhorn
Copy link
Contributor

dajhorn commented Aug 30, 2014

when a drive's entry appears in /dev/disk/by-id does this guarantee it will be ready for ZFS?

Yes.

Does putting this loop in /etc/init/mountall.conf stall ZFS automount until the loop terminates (from the description in the FAQ I think it does)?

Yes.

So, I'm thinking that if I set up a loop in /etc/init/mountall.conf as suggested in the FAQ above and I modify it to make the loop continue until all my RAIDZ2 drives appear in /dev/disk/by-id then I should be reasonably assured that I'll avoid the conditions that made mounting the RAIDZ2 /home directory unreliable?

Yes, right.

Finally, if I may ask, is the line: for file in /sys/block/sd* /sys/block/sd_/sd_; do udevadm test $file; done >/dev/null 2>&1 essential to the fix in the FAQ?

Dunno, but it might have beneficial side effects like ensuring that the udev event queue is flushed.

I'm not entirely sure of what the following function in the FAQ script does? failsafedbus()

This ensures that the system starts a console even if something goes wrong, like a typo in the script. These things run early enough that any mistake can break the system.

@montanaviking
Copy link
Author

Hi Darik,
Thanks again for relieving my ignorance. At this rate, I'm going to actually begin to understand the internals of Linux.
I'm going to try the new script and report how it works. Basically, I'm going to modify the FAQ script by changing the loop to continue (using an if statement) until all the RAIDZ2 drives appear but will keep the other parts essentially the same as shown in the FAQ. I expect that the loop's timeout of 60 tries will handle the possibility of a truly defective drive(s), allowing ZFS to then attempt to mount a truly degraded array.
I'm using cheap consumer grade disks.
Thanks,
Phil

Sent from my iPad

On Aug 30, 2014, at 3:41 PM, Darik Horn notifications@github.com wrote:

when a drive's entry appears in /dev/disk/by-id does this guarantee it will be ready for ZFS?

Yes.

Does putting this loop in /etc/init/mountall.conf stall ZFS automount until the loop terminates (from the description in the FAQ I think it does)?

Yes.

So, I'm thinking that if I set up a loop in /etc/init/mountall.conf as suggested in the FAQ above and I modify it to make the loop continue until all my RAIDZ2 drives appear in /dev/disk/by-id then I should be reasonably assured that I'll avoid the conditions that made mounting the RAIDZ2 /home directory unreliable?

Yes, right.

Finally, if I may ask, is the line: for file in /sys/block/sd* /sys/block/sd_/sd_; do udevadm test $file; done >/dev/null 2>&1 essential to the fix in the FAQ?

Dunno, but it might have beneficial side effects like ensuring that the udev event queue is flushed.

I'm not entirely sure of what the following function in the FAQ script does? failsafedbus()

This ensures that the system starts a console even if something goes wrong, like a typo in the script. These things run early enough that any mistake can break the system.


Reply to this email directly or view it on GitHub.

@montanaviking
Copy link
Author

Hi Darik,
I've implemented the following script: Let the filename of this script be /root/mountdelay
Be sure to set it as executable after building it, i.e. # chmod +x /root/mountdelay
from a root command line

############################################################
#!/bin/sh

waitavail() {
date +"%D %T Checking devices..."
count=0
diskdevdir="/dev/disk/by-id/"
echo $diskdevdir
while [ true ]; do
if [ -e ${diskdevdir}xxx1-part1 ] && [ -e ${xxx1-part9 ] && [ -e ${diskdevdir}xxx2-part1 ] && [ -e ${diskdevdir}xxx2-part9 ] && [ -e ${diskdevdir}xxx3-part1 ] && [ -e ${diskdevdir}xxx3-part9 ] && [ -e ${diskdevdir}xxx4-part1 ] && [ -e ${diskdevdir}xxx4-part9 ]
then break;
echo "loop"
fi
for file in /sys/block/sd* /sys/block/sd_/sd_; do udevadm test $file; done >/dev/null 2>&1
printf "\r%s\r" "$(date +'%D %T Waiting on devices...')"
sleep 1
count=$((count+1))
[ "$count" -gt 60 ] && break
done
if [ "$count" -gt 60 ]; then
date +"%D %T Gave up checking devices."
else
date +"%D %T Done checking devices."
fi
}

failsafedbus() {
sleep 5
/sbin/start dbus
/sbin/start tty2
}

waitavail >/dev/tty1 2>&1
failsafedbus &
exit 0
#######################################################################

Please notice the if [] && [] && [] .... statement above as it is the main tool of the script.
This if statement is true when and ONLY if all the drives specified by their ID filename in /dev/disk/by-id/ show up during the boot process. In my implementation, on my machine, I have two ID files for each of my four drives in my RAIDZ2 array, - a total of eight if statement condition - one for each of the two disk partitions. Perhaps this is redundant and I only need the disk ID, but I'm paranoid about these things.

Basically, the if statement will cause the loop to continue so long as at least one of the drive IDs is missing. When all drives are finally up, running, and ready for mount, their IDs should appear in the /dev/disk/by-id directory and the if statement will evaluate as true, allowing the break statement to execute and the loop to terminate. This will then allow ZFS to attempt its automount of the array - but not until all the drives are ready as indicated by the appearance of their ID files in /dev/disk/by-id.
Again, I'm checking for each of the two drive partitions' ID files but perhaps the ID file of the drive itself might be sufficient - you would have to verify that.

The above script, in /root/mountdelay, is executed in the /etc/init/mountall.conf file and you need to add the following line to your /etc/init/mountall.conf file
...
/root/mountdelay

The above line should be added just before the line below:

exec mountall --daemon $force_fsck $fsck_fix
...

Also, please remember to set the /root/mountdelay script file as executable!

So ultimately please notice that in the above, one MUST modify the values in the if statement as:
xxx1 -> your 1st ZFS drive's ID
xxx2 -> your 2nd ZFS drive's ID
xxx3 -> your 3rd ZFS drive's ID
xxx4 -> your 4th ZFS drive's ID
and so on for all your ZFS drives. You must include ALL the drive ID filenames associated with ALL the drives in your ZFS arrays that you want to mount on boot. While I haven't tried ZIL drives, I also suggest you include them in the if statement too.
The drives' IDs are obtained by looking in your /dev/disk/by-id directory when your system is booted. You may wish to use disk utilities or other tools to positively obtain the disk IDs of the disks in your ZFS array(s). This is important if you want the script to reliably protect you from mis-mounting your ZFS arrays on boot.

Prior to this patch, I had been relying on an /etc/rc.local file containing:
zfs mount -a
Since my mysql data was on my ZFS array, MYSQL would not properly start because the ZFS array was not available at an early enough time so I also had to include:
service mysql start
as well in my /etc/rc.local
BUT the above mods to /etc/rc.local WERE NOT reliable and I'd still occasionally get my ZFS array coming up degraded due to, I believe race conditions and "disk not ready" problems at boot.
Moreover, having to restart a failed mysql autostart was also a "kludge" that I'd wanted to avoid.
I currently recommend NOT modifying rc.local to fix your ZFS boot problems but rather using the above. While modifying /etc/rc.local did improve ZFS auto mounting reliability on boot is was still not totally reliable. So far the script discussed above appears to give me reliable booting and I'll let you all know if I run into problems.
Did I miss something? Your comments are most welcome,
Phil

Please comment if I'm missing something.

@dajhorn
Copy link
Contributor

dajhorn commented Aug 31, 2014

You must include ALL the drive ID filenames associated with ALL the drives in your ZFS arrays that you want to mount on boot. While I haven't tried ZIL drives, I also suggest you include them in the if statement too.

Right, and it should now be apparent how this kind of solution does not solve the general problem.

I had been relying on an /etc/rc.local file containing:
zfs mount -a
Since my mysql data was on my ZFS array, MYSQL would not properly start because the ZFS array was not available at an early enough time

The Ubuntu init stack relies on mountall issuing MOUNTED events for each mount point, which is why any form of zfs mount -a is not recommended.

Did I miss something?

It seems good at first glance, but I don't have a way to test it. This is it-works-for-me territory until #330 is implemented.

HTH.

@montanaviking
Copy link
Author

Hi Darik,
Would you like me to put my answer (the above) in a wiki or the FAQ (https://github.com/zfsonlinux/pkg-zfs/wiki/Ubuntu-ZFS-mountall-FAQ-and-troubleshooting#43--jbod-mode-on-hardware-raid-implementations) 4.2 paragraph? I'm thinking that others should be notified of this patch as I think I'm not the only one who struggled with this?
Thanks again for all your help.
Phil

@dajhorn
Copy link
Contributor

dajhorn commented Aug 31, 2014

@montanaviking, yes, feel free to edit the wiki. Anybody with a GitHub account has write access.

@montanaviking
Copy link
Author

@montanaviking
Copy link
Author

Hi,
I just update my Ubuntu 14.04 64-bit box with the new zfs-dkms module and my machine failed to reboot post update of the new zfs-dkms as of today (Nov 20, 2014).
I have a two drive mirror (sim) and RAIDZ2 home directory pool name tank. The tank pool mounts to /home. Both sim and tank pools where set to automount.
NOOB error:
I had put the mountpoint of my pool named sim within the /home directory. In other words, sim was mounted within the mountpoint of tank. I apparently got away with this arrangement until the latest ZFS update. In fact ZFS apparently became so confused that I could not effectively export the pools when in the recovery mode. Even though the pools reported as exported, ZFS would attempt to mount them on subsequent boots, stopping booting for all but the recovery root shell mode. To fix this, I needed to unistall zfs-dkms in the root shell recovery mode, reboot, reinstall zfs-dkms, import tank and sim pools, and set sim to a new mountpoint outside that of /home. Perhaps non-empty mountpoints for ZFS pools are likely to be a problem in any case?
Is there a log where I can see what ZFS thought of this whole thing?

So, I'm posting this for three main reasons.

  1. To warn others not to mount one ZFS pool inside another ZFS pool's mountpoint.
  2. I wonder if it would be possible to have ZFS detect such mounting arrangements and automatically mount the pools mounted to the lower level mountpoints first and subsequently mount pools requesting lower-level mountpoints later - an automatically-controlled hierarchy of the sequence of ZFS pool mountings that would prevent one ZFS mountpoint from overrunning another. If the user made the mistake of assigning the same mountpoint to more than one pool, then there should be a rule as to which pool is mounted and the user should be warned.
  3. Is this fundamentally a problem of ZFS attempting to mount a nonzero mountpoint and becoming confused as a result?

Thanks,
Phil

Sent from my iPad

On Aug 31, 2014, at 4:57 PM, Darik Horn notifications@github.com wrote:

@montanaviking, yes, feel free to edit the wiki. Anybody with a GitHub account has write access.


Reply to this email directly or view it on GitHub.

@montanaviking
Copy link
Author

Hi Darik,
Also, did the new ZFS on Linux update address the apparent race
condition I was seeing titled "

Intermittent unreliable zpool mounting on boot #2444"? and

https://github.com/zfsonlinux/pkg-zfs/wiki/Ubuntu-ZFS-mountall-FAQ-and-troubleshooting#43--jbod-mode-on-hardware-raid-implementations

My patch reported there does seem to be working solidly for my machine.
To summarize, I had seen what appeared to be problems with ZFS
attempting to auto-import and/or automount a RAIDZ2 pool of four
mechanical 1TB drives on booting. Sometimes, one or more of the drives
in the RAIDZ2 pool would get rejected as bad from the array on boot.
Forcing the system to call and run a script from the
/etc/init/mountall.conf script which checks to see that all drives in
the pool are present before going on, appears to solve the intermittent
mount failure issues.

Will this be necessary or prudent after the recent ZFS update?
Thanks,
Phil

On 08/30/2014 03:41 PM, Darik Horn wrote:

when a drive's entry appears in /dev/disk/by-id does this
guarantee it will be ready for ZFS?

Yes.

Does putting this loop in /etc/init/mountall.conf stall ZFS
automount until the loop terminates (from the description in the
FAQ I think it does)?

Yes.

So, I'm thinking that if I set up a loop in
/etc/init/mountall.conf as suggested in the FAQ above and I modify
it to make the loop continue until all my RAIDZ2 drives appear in
/dev/disk/by-id then I should be reasonably assured that I'll
avoid the conditions that made mounting the RAIDZ2 /home directory
unreliable?

Yes, right.

Finally, if I may ask, is the line: for file in /sys/block/sd*
/sys/block/sd*/sd*; do udevadm test $file; done >/dev/null 2>&1
essential to the fix in the FAQ?

Dunno, but it might have beneficial side effects like ensuring that
the udev event queue is flushed.

I'm not entirely sure of what the following function in the FAQ
script does? failsafedbus()

This ensures that the system starts a console even if something goes
wrong, like a typo in the script. These things run early enough that
any mistake can break the system.


Reply to this email directly or view it on GitHub
#2444 (comment).

@dajhorn
Copy link
Contributor

dajhorn commented Nov 21, 2014

Perhaps non-empty mountpoints for ZFS pools are likely to be a problem in any case?

Yes. Commit f67d709 in ZoL 0.6.4 will provide an easy way to avoid this Solaris behavior, which has several side-effects on Linux.

Is there a log where I can see what ZFS thought of this whole thing?

Add the --debug switch to the exec mountall line in the /etc/init/mountall.conf file. Reboot, and watch /var/log for messages, especially the /var/log/upstart/mount* files.

I wonder if it would be possible to have ZFS detect such mounting arrangements and automatically mount the pools mounted to the lower level mountpoints first and subsequently mount pools requesting lower-level mountpoints later

You're describing a system configuration that I test and that usually works. However, any configuration that mountall doesn't already handle for regular built-in filesystems also won't be handled for ZFS.

Also, did the new ZFS on Linux update address the apparent race condition I was seeing titled "Intermittent unreliable zpool mounting on boot #2444"? and

Yes, the /etc/init/zpool-import.conf file has a generalized fix for issues like #2444 (and especially #330). Notice the new ZFS_AUTOIMPORT_TIMEOUT option in the /etc/default/zfs file.

@montanaviking
Copy link
Author

Hi Derik,
Thanks for all your help and insights. If I may ask another unrelated question:
I have an old (vintage 2007) Quad Core Intel machine with 8GB RAM that I want to re-use as a small server in the university where I work. The RAM is non-ECC.
Given the warnings I've heard about non-ECC with ZFS, would it be better to run ZFS or EXT-4 for the filesystem? I'm thinking that RAM corruption would be equally bad on nearly any filesystem. On the other hand, since ZFS uses most of the RAM when it can, for cache, it's perhaps more exposed to RAM problems than other filesystems. What about BTRFS?
What would be your opinion here? Fortunately, my new personal server is ECC and I'll never get another machine (other than a laptop) without it. Makes me wonder why one can't get laptops with ECC?
Thanks,
Phil
Sent from my iPad

On Nov 21, 2014, at 8:31 AM, Darik Horn notifications@github.com wrote:

Perhaps non-empty mountpoints for ZFS pools are likely to be a problem in any case?

Yes. Commit f67d709 in ZoL 0.6.4 will provide an easy way to avoid this Solaris behavior, which has several side-effects on Linux.

Is there a log where I can see what ZFS thought of this whole thing?

Add the --debug switch to the exec mountall line in the /etc/init/mountall.conf file. Reboot, and watch /var/log for messages, especially the /var/log/upstart/mount* files.

I wonder if it would be possible to have ZFS detect such mounting arrangements and automatically mount the pools mounted to the lower level mountpoints first and subsequently mount pools requesting lower-level mountpoints later

You're describing a system configuration that I test and that usually works. However, any configuration that mountall doesn't already handle for regular built-in filesystems also won't be handled for ZFS.

Also, did the new ZFS on Linux update address the apparent race condition I was seeing titled "Intermittent unreliable zpool mounting on boot #2444"? and

Yes, the /etc/init/zpool-import.conf file has a generalized fix for issues like #2444 (and especially #330). Notice the new ZFS_AUTOIMPORT_TIMEOUT option in the /etc/default/zfs file.


Reply to this email directly or view it on GitHub.

@dajhorn
Copy link
Contributor

dajhorn commented Nov 21, 2014

Given the warnings I've heard about non-ECC with ZFS, would it be better to run ZFS or EXT-4 for the filesystem?

If the upstream engineering team thinks that lack of ECC is an actual problem for ZFS, then they could easily check for it at runtime and squawk a warning into the system log, but they don't.

Now more people know how scrape data out of a bad EXT4 instance than ZFS, but I haven't seen any evidence that using EXT4 on duff equipment mitigates data loss in any way vice the end-to-end protection that ZFS provides.

Personally, I would rather wear a bike helmet than live near a good brain surgeon, even if the bike helmet refuses to mount /skull or segfaults in an unfashionable way when it detects a fault.

What about BTRFS? What would be your opinion here?

Dunno. I don't use it often enough to know how it behaves in this circumstance.

@montanaviking
Copy link
Author

Hi Darik,
That's kind of as I suspected. EXT-4 won't really mitigate non-ECC issues better than ZFS, but does add risk. I figure that I don't want to be scraping data from a bad EXT-4 mount anyway, and would just keep a good set of backups. In any case, I don't recommend keeping critical data on non-ECC machines and to do so is a fool's errand.
Thanks,
Phil

Sent from my iPad

On Nov 21, 2014, at 4:23 PM, Darik Horn notifications@github.com wrote:

Given the warnings I've heard about non-ECC with ZFS, would it be better to run ZFS or EXT-4 for the filesystem?

If the upstream engineering team thinks that lack of ECC is an actual problem for ZFS, then they could easily check for it at runtime and squawk a warning into the system log, but they don't.

Now more people know how scrape data out of a bad EXT4 instance than ZFS, but I haven't seen any evidence that using EXT4 on duff equipment mitigates data loss in any way vice the end-to-end protection that ZFS provides.

Personally, I would rather wear a bike helmet than live near a good brain surgeon, even if the bike helmet refuses to mount /skull or segfaults in an unfashionable way when it detects a fault.

What about BTRFS? What would be your opinion here?

Dunno. I don't use it often enough to know how it behaves in this circumstance.


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants