-
Notifications
You must be signed in to change notification settings - Fork 30
Occasional install to disk fails - BLKRRPART, usually requires reboot, re-do install steps to proceed #152
Comments
Hm, it is possible that we are racing with udev. I don't know for sure but maybe if udev triggered BLKRRPART before us and is now probing the filesystems on the partitions the open partition device node(s) will cause the disk do be considered in-use. We could call udevadm settle but due to the way udev detects changes to disks it is hard to avoid racing with it. I will need to do more research to figure out the best solution. |
fyi: we've seen the same issue and employeed same workaround as @jl-montes |
Same problem experienced tonight (v444.4.0) on a bare-metal install from CD; booted again & retried install, happened again. Funny thing is after that I booted from disk to see what would happen, CoreOS was actually installed, just didn't pick up my cloud-init (which was on a path mounted from USB drive.) |
I ran into this same issue repeated. Posted to the google group, but I ran into this issue 19 out of 20 times even with rebooting. However I found a second work around that does not require a reboot. After receiving the BLKRRPART error then I unmounted the drive I was installing to fixed the GPT to span the entire drive and ran the install script again. This worked 100% of the time. I tested it on three machines that I received the BLKRRPART or segfault errors on while trying to install to disk. |
Same for me as for @wdennis: Script failed with BLKRRPART, but coreos was actually installed and it booted. To be sure everything is alright I shutdown the instance of CoreOS that got installed, re-ran the ISOLINUX and re-ran the install script once again. This time: Some more info that might be helpfull: |
this error also happens when installing via ISO (CD-ROM), however, if you actually eject the CD, and reboot... you will find that it has indeed installed coreOS.... |
It seems unlikely that udev is the culprit: I just tried killing it until systemd gave up restarting it, and still
Using CoreOS stable (CoreOS 607.0.0), also PXE-booted. I am at a loss to determine what is actually keeping the partitions busy, but since I can reproduce this at will, please feel free to suggest commands to try. |
Interesting, so there must be two races going on. There is a race with The explicit reprobe in this script is there because only relatively recent
|
encounter this issue on 633.1.0, any update? |
I had the same error installing from ISO. From USB I got Input/Output Error. Pulling up gparted from the Ubuntu live cd, it fixed errors in the GPT layout, which was all small partitions (I was installing to a 1TB HD). Reran twice, same error. |
Same error here. PXE booted and Installing from stable channel. Just followed same workaround as @jl-montes but the success rate is way lower. |
Same issue installing with 745.1.0. I had the issue installing from a ubuntu key, and then work. Now I try to reinstall from inside the running coreos. |
We ran into this several times, and I think I found the culprit in at least one scenario. We kept encountering this with client provisions and couldn't reproduce in lab until we used the client's exact userdata, and then it was intermittent. When cloud-init contains anything that involves a download it seems to lock the device. The weird part is it does this even if the disk doesn't have a filesystem on it at all. I suspect docker or something else in third-party expects a disk to be present and if it isn't it picks what it thinks should be the first one /dev/sda and tries to use it. For our installer image, I removed any access to userdata until the second boot and haven't encountered it again yet. Will update if I see it again. |
So in the case of our farm of bare-metal boxes, I think that the issue was pre-existing LVM volumes . Zapping them with vgremove prior to running coreos-install solves the issue for me. Teaching coreos-install to do same could be worthwhile, although slightly trickier for multi-disk systems. |
I can confirm using vgremove as @domq mentioned just worked for me on baremetal booted via IPMI mounted ISO. |
I had this issue, too. The disk used to be part of a RAID. I went ahead and reformatted the disk with fdisk and it installed successfully. Not sure if this is actually the cause of the success though. |
Just ran into this, device used to be part of a LVM volume group. Will remove metadata manually and see if it works. |
I had this issue, too. at about line 303 It's works. |
It seems like this issue is getting a bit muddied. Unfortunately the BLKRRPART error can be caused by a variety of reasons and sometimes is legitimate reporting (device is actually busy). I think for this issue to remain valid we need to create specific reproduction steps and treat each verified set of reproduction steps as a new issue. The issue we had has been resolved (see this comment). Does anyone have specific reproduction steps? Otherwise I suggest archiving this for reference and treating new instances as new issues with a focus on tracking it down per hardware setup. Does that sound reasonable? |
One case is definitely active LVM Volume groups on DEVICE from a previous install of another OS (my use case is having to install CoreOS via grub from a previous CentOS install). Pre Install state:
Post Install state:
These should be set inactive before the image is written to DEVICE:
I'm using this unit for now in my oem cloud config (a bit more of a hammer):
|
I can reproduce this quite easily PXE booting on a test cluster of Dell Optiplex 960 towers (desktop class machines) and running coreos-install. I wonder how this lower spec of hardware comes into play ... (slow disks, unsafe caching, etc). This is an example of a failed install:
Both
|
@coconutpilot neat! Would you mind filing a PR against https://github.com/coreos/init/blob/master/bin/coreos-install? |
👍 |
Before I submitted my pull request I wanted to do some more testing. A simpler test case of the bug is (may need to run a few times):
Looking at the kernel ioctl: http://lxr.free-electrons.com/source/block/ioctl.c#L184 its wrapped in a mutex so I am at a loss as to why this is happening? As noted by @marineam in the first comment to the issue report it seems to be a race with udev, this is where udev does BLKRRPART: This means my proposed fix A proposal for a solution:
Does this sound sane? |
I got the same issue. Only worked after deleting the VG. |
When I tried install on disk the CoreOS stable (835.12.0) on XEN 3.16 PV Guest I had the BLKRRPAR error. |
I just hit it when I am installing CoreOS on machines that had CentOS installed and for me removing DM mapping prior to running installer fixes the issue:
centos_blah_blah names can be listed from /dev/mapper/*. After that I can write CoreOS to /dev/sda |
The sfdisk tool used for partitioning SD-cards has, especially in an older 2.23.2 version on CentOS 7, problems re-reading the partitions it just created. As even the --force parameter does not prevent sfdisk from grinding to a halt, use --no-reread instead. This avoids races with udev, as stated in numerous bug reports: https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/942788 coreos/bugs#152 https://bugs.centos.org/view.php?id=986 Change-Id: I09c4a90c99e324abb8469d6bad1465713d7c8b32 Signed-off-by: Bert van Hall <bert.vanhall@avionic-design.de> Reviewed-on: http://review.adnet.avionic-design.de/5446 Reviewed-by: Dirk Leber <dirk.leber@avionic-design.de>
Just bumped into this issue on a machine that was previously a RAID-1 setup. Got it to work after thrashing both disks' MBR and rebooting: BEWARE: this will trash all data in those disks!
|
@haolez which version of CoreOS was being used to run coreos_install? |
@crawford 1122.2.0 EDIT Ah, sorry! I was using 1122.2.0 to run coreos-install. The one mentioned previously was what I was trying to install locally. (Not a production environment) |
This still happens on 1153.0.0 as well. Fixed it with @haolez dd trick. |
@haolez interesting. |
Closing due to inactivity. |
Hi, I am getting the below error while doing a pxeboot installation of CoreOS v1122.3.0 in virtualbox: What we notice is that the coreos-install will fail and the last error indicates BLKRRPART: Device or resource busy I tried no of suggestions mentioned above but none of it solve this problem. |
Experienced this on current stable (1235.12.0). The storage was previously set up with LVM. As @wdennis stated, the install had actually completed and on reboot came up to CoreOS. The contents of my |
I was installing coreos 1576.4.0 on a machine that had ubuntu installed with LVM active. I experienced the same thing and so I tried to reproduce it on a different machine that also had LVM installed as well as the negative control of a machine with no LVM on a previous install. This is completely reproducible on 4 attempts. For the fourth attempt, I dd-ed a LVM Linux system on to a machine before doing the coreos install. Installs but no cloud-config details... SO reboot and re-install and that clears it up. I was installing after booting into a live USB of ubuntu and I tried it with a live USB of a Centos as well. I wonder what is holding on to the disk. I will try later this weekend(we are getting a snow storm ... so not much to do out there) with adding the deactivate LVM to my install script... |
It sounds as though the primary culprits are open LVM PVs and RAID volumes. coreos-install shouldn't automatically close them, but it could check for this case by running An ambitious implementation could also check whether the disk has open LVM PVs ( |
I did a disk install of CoreOS and I see the exact same issue.
Here are the steps that I took:
|
I just ran into this and in my case, it was because the disk currently had Container Linux instead. When I booted the ISO image, it mounted |
This is still happening and I can't figure out the fix from this issue, can someone point me to a workaround? $ sudo coreos-install -d /dev/sdb -i ignition.json
Current version of CoreOS Container Linux stable is 2191.5.0
Downloading the signature for https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2...
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
2019-09-17 17:40:36 URL:https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2.sig [566/566] -> "/tmp/coreos-install.ILrflXrixd/coreos_production_image.bin.bz2.sig" [1]
Downloading, writing and verifying coreos_production_image.bin.bz2...
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
^[
2019-09-17 17:48:10 URL:https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2 [481116178/481116178] -> "-" [1]
gpg: Signature made mié 04 sep 2019 01:27:14 -03
gpg: using RSA key FD986FB096482F906F55B2EA01C9CAE767B3CA0E
gpg: key 50E0885593D2DCB4 marked as ultimately trusted
gpg: checking the trustdb
gpg: marginals needed: 3 completes needed: 1 trust model: pgp
gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u
gpg: Good signature from "CoreOS Buildbot (Offical Builds) <buildbot@coreos.com>" [ultimate]
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb |
I am using the Fedora CoreOS it works much better - https://github.com/coreos/coreos-installer/ - Download images from https://getfedora.org/coreos/download/ - If you have issues mailing lists are very responsive. |
I managed to 'fix' the install on my thumbdrive using @ivarec solution dd if=/dev/zero of=/dev/sda bs=512 count=1 If we're gonna wipe the device anyways, maybe the |
We had a similar problem with Flatcar. After some investigation I found that |
We've seen for the past several months occasional and random behavior when attempting disk installations after PXE booting to an in-memory version of CoreOS.
What we notice is that the coreos-install will fail and the last error indicates BLKRRPART: Device or resource busy
The work-around we've typically employed it to reboot, PXE again to CoreOS in-memory, then attempt the disk install again, 98-99% of the time we never see the error again and we get a Successful install to disk
Attached is a sample screen of when the random failure happens.
We've seen this on bare-metal blade servers and pizza-box servers, KVM vm's, and Hyper-V vm;s in the past.
The text was updated successfully, but these errors were encountered: