Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

Occasional install to disk fails - BLKRRPART, usually requires reboot, re-do install steps to proceed #152

Open
jl-montes opened this issue Sep 26, 2014 · 43 comments

Comments

@jl-montes
Copy link

We've seen for the past several months occasional and random behavior when attempting disk installations after PXE booting to an in-memory version of CoreOS.

What we notice is that the coreos-install will fail and the last error indicates BLKRRPART: Device or resource busy

The work-around we've typically employed it to reboot, PXE again to CoreOS in-memory, then attempt the disk install again, 98-99% of the time we never see the error again and we get a Successful install to disk

Attached is a sample screen of when the random failure happens.

We've seen this on bare-metal blade servers and pizza-box servers, KVM vm's, and Hyper-V vm;s in the past.

coreos-blkrrpart-devicebusy

@marineam
Copy link

Hm, it is possible that we are racing with udev. I don't know for sure but maybe if udev triggered BLKRRPART before us and is now probing the filesystems on the partitions the open partition device node(s) will cause the disk do be considered in-use. We could call udevadm settle but due to the way udev detects changes to disks it is hard to avoid racing with it. I will need to do more research to figure out the best solution.

@jumanjiman
Copy link

fyi: we've seen the same issue and employeed same workaround as @jl-montes

@wdennis
Copy link

wdennis commented Nov 20, 2014

Same problem experienced tonight (v444.4.0) on a bare-metal install from CD; booted again & retried install, happened again. Funny thing is after that I booted from disk to see what would happen, CoreOS was actually installed, just didn't pick up my cloud-init (which was on a path mounted from USB drive.)

@ChrisGuilbault
Copy link

I ran into this same issue repeated. Posted to the google group, but I ran into this issue 19 out of 20 times even with rebooting.

However I found a second work around that does not require a reboot.

After receiving the BLKRRPART error then I unmounted the drive I was installing to fixed the GPT to span the entire drive and ran the install script again. This worked 100% of the time. I tested it on three machines that I received the BLKRRPART or segfault errors on while trying to install to disk.

@cimnine
Copy link

cimnine commented Jan 9, 2015

Same for me as for @wdennis: Script failed with BLKRRPART, but coreos was actually installed and it booted. To be sure everything is alright I shutdown the instance of CoreOS that got installed, re-ran the ISOLINUX and re-ran the install script once again. This time: Success!

Some more info that might be helpfull:
I have an Intel NUC. The disc is some SSD.
The disk had a multi-partition layout before. (Previously installed was XMBCbuntu with 'default' XMBCbuntu partition layout. Not that I expect XMBCbuntu to be the cause of the trouble, but it might help you to get an idea of the previously configured partition layout.)

johscheuer added a commit to johscheuer/theforeman-coreos-kubernetes that referenced this issue Feb 23, 2015
@apscomp
Copy link

apscomp commented Mar 18, 2015

this error also happens when installing via ISO (CD-ROM), however, if you actually eject the CD, and reboot... you will find that it has indeed installed coreOS....

@domq
Copy link

domq commented Mar 19, 2015

It seems unlikely that udev is the culprit: I just tried killing it until systemd gave up restarting it, and still

bash-4.2# sfdisk -R /dev/sda
sfdisk: BLKRRPART: Device or resource busy
sfdisk: This disk is currently in use.

Using CoreOS stable (CoreOS 607.0.0), also PXE-booted. I am at a loss to determine what is actually keeping the partitions busy, but since I can reproduce this at will, please feel free to suggest commands to try.

@marineam
Copy link

Interesting, so there must be two races going on. There is a race with
udev, it uses inotify to detect if the full-disk node has been written to
and will then issue the same ioctl to have the kernel reprobe the device
and then as it receives the uevent back from the kernel it will recreate
/dev/disk/by-* symlinks so if you mount by label or similar too quickly it
can fail due to the missing link. We observe this while building CoreOS
images from time to time.

The explicit reprobe in this script is there because only relatively recent
versions of udev have this automatic behavior and even in versions that do
there isn't a reliable way to wait for udev to do its magic. I previously
assumed that the busy errors were due to our reprobe happening while udev
was working but if that is not the case we have a whole new mystery on our
hands.
On Mar 19, 2015 11:33 AM, "domq" notifications@github.com wrote:

It seems unlikely that udev is the culprit: I just tried killing it until
systemd gave up restarting it, and still

bash-4.2# sfdisk -R /dev/sda
sfdisk: BLKRRPART: Device or resource busy
sfdisk: This disk is currently in use.


Reply to this email directly or view it on GitHub
#152 (comment).

@cybertk
Copy link

cybertk commented May 20, 2015

encounter this issue on 633.1.0, any update?

@threedliteguy
Copy link

I had the same error installing from ISO. From USB I got Input/Output Error. Pulling up gparted from the Ubuntu live cd, it fixed errors in the GPT layout, which was all small partitions (I was installing to a 1TB HD). Reran twice, same error.

@scari
Copy link

scari commented Jul 8, 2015

Same error here. PXE booted and Installing from stable channel. Just followed same workaround as @jl-montes but the success rate is way lower.

@tdeheurles
Copy link

Same issue installing with 745.1.0.

I had the issue installing from a ubuntu key, and then work.

Now I try to reinstall from inside the running coreos.
Can someone confirm that the command coreos-install can be executed from the running os ?

@stresler
Copy link

stresler commented Aug 7, 2015

We ran into this several times, and I think I found the culprit in at least one scenario.

We kept encountering this with client provisions and couldn't reproduce in lab until we used the client's exact userdata, and then it was intermittent.

When cloud-init contains anything that involves a download it seems to lock the device. The weird part is it does this even if the disk doesn't have a filesystem on it at all.

I suspect docker or something else in third-party expects a disk to be present and if it isn't it picks what it thinks should be the first one /dev/sda and tries to use it.

For our installer image, I removed any access to userdata until the second boot and haven't encountered it again yet. Will update if I see it again.

@domq
Copy link

domq commented Aug 7, 2015

So in the case of our farm of bare-metal boxes, I think that the issue was pre-existing LVM volumes .

Zapping them with vgremove prior to running coreos-install solves the issue for me. Teaching coreos-install to do same could be worthwhile, although slightly trickier for multi-disk systems.

@blemmenes
Copy link

I can confirm using vgremove as @domq mentioned just worked for me on baremetal booted via IPMI mounted ISO.

@Evidlo
Copy link

Evidlo commented Nov 10, 2015

I had this issue, too. The disk used to be part of a RAID.

I went ahead and reformatted the disk with fdisk and it installed successfully. Not sure if this is actually the cause of the success though.

@Wintereise
Copy link

Just ran into this, device used to be part of a LVM volume group.

Will remove metadata manually and see if it works.

@wgzhao
Copy link

wgzhao commented Nov 25, 2015

I had this issue, too.
I use system rescuecd liveOS booting,then ran core-install /dev/sda .but it give me
BLKRRPART: Device or resource busy
I comment out
blockdev --rereadpt "${DEVICE}

at about line 303

It's works.

@stresler
Copy link

It seems like this issue is getting a bit muddied. Unfortunately the BLKRRPART error can be caused by a variety of reasons and sometimes is legitimate reporting (device is actually busy).

I think for this issue to remain valid we need to create specific reproduction steps and treat each verified set of reproduction steps as a new issue. The issue we had has been resolved (see this comment).

Does anyone have specific reproduction steps? Otherwise I suggest archiving this for reference and treating new instances as new issues with a focus on tracking it down per hardware setup.

Does that sound reasonable?

@gbock
Copy link

gbock commented Dec 3, 2015

One case is definitely active LVM Volume groups on DEVICE from a previous install of another OS (my use case is having to install CoreOS via grub from a previous CentOS install).

Pre Install state:

localhost ~ # parted -s /dev/sda print
Model: ATA CentOS to CoreOS (scsi)
Disk /dev/sda: 68.7GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  525MB   524MB   primary  ext4         boot
 2      525MB   68.7GB  68.2GB  primary               lvm

localhost ~ # lsblk 
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0    64G  0 disk 
|-sda1                 8:1    0   500M  0 part 
`-sda2                 8:2    0  63.5G  0 part 
  |-VolGroup-lv_home 254:0    0    31G  0 lvm  
  |-VolGroup-lv_root 254:1    0  30.5G  0 lvm  
  `-VolGroup-lv_swap 254:2    0     2G  0 lvm  
sr0                   11:0    1  96.4M  0 rom  
sr1                   11:1    1   395M  0 rom  
loop0                  7:0    0 182.1M  0 loop /usr

Post Install state:

localhost ~ # parted -s /dev/sda print
Warning: Not all of the space available to /dev/sda appears to be used, you can fix the GPT to use all of the space (an extra 124928000 blocks) or continue with the current setting? 
Model: ATA CentOS to CoreOS (scsi)
Disk /dev/sda: 68.7GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size    File system  Name        Flags
 1      2097kB  136MB   134MB   fat16        EFI-SYSTEM  boot, legacy_boot, esp
 2      136MB   138MB   2097kB               BIOS-BOOT   bios_grub
 3      138MB   1212MB  1074MB  ext2         USR-A
 4      1212MB  2286MB  1074MB               USR-B
 6      2286MB  2420MB  134MB   ext4         OEM
 7      2420MB  2487MB  67.1MB               OEM-CONFIG
 9      2487MB  4754MB  2267MB  ext4         ROOT

localhost ~ # lsblk 
NAME                 MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                    8:0    0    64G  0 disk 
|-sda1                 8:1    0   500M  0 part 
`-sda2                 8:2    0  63.5G  0 part 
  |-VolGroup-lv_home 254:0    0    31G  0 lvm  
  |-VolGroup-lv_root 254:1    0  30.5G  0 lvm  
  `-VolGroup-lv_swap 254:2    0     2G  0 lvm  
sr0                   11:0    1  96.4M  0 rom  
sr1                   11:1    1   395M  0 rom  
loop0                  7:0    0 182.1M  0 loop /usr

These should be set inactive before the image is written to DEVICE:

---- /usr/bin/coreos-install    2015-12-01 02:01:29.000000000 +0000
+++ /tmp/coreos-install 2015-12-03 16:20:37.535278241 +0000
@@ -284,6 +284,9 @@
 echo "Downloading the signature for ${IMAGE_URL}..."
 wget --inet4-only --no-verbose -O "${WORKDIR}/${SIG_NAME}" "${SIG_URL}"

+# Deactivate any LVM volume groups on DEVICE
+vgs -a -o +devices | awk "/ ${DEVICE//\//\/}[0-9]/{print \$1}" | sort -u | xargs -n1 vgchange -an
+
 echo "Downloading, writing and verifying ${IMAGE_NAME}..."
 declare -a EEND
 if ! wget --inet4-only --no-verbose -O - "${IMAGE_URL}" \

I'm using this unit for now in my oem cloud config (a bit more of a hammer):

coreos:
  units:
    - name: deactivate-lvm.service
      command: start
      content: |
        [Service]
        Type=oneshot
        ExecStart=/bin/sh -c '/usr/sbin/vgs --noheadings -o vg_name | /usr/bin/xargs -n1 /usr/sbin/vgchange -an'

@coconutpilot
Copy link

I can reproduce this quite easily PXE booting on a test cluster of Dell Optiplex 960 towers (desktop class machines) and running coreos-install. I wonder how this lower spec of hardware comes into play ... (slow disks, unsafe caching, etc).

This is an example of a failed install:

core@fox3 ~ $ sudo coreos-install -C alpha -V current -d /dev/sda -c fox3.yaml 
Checking availability of "local-file"
Fetching user-data from datasource of type "local-file"
Downloading the signature for http://alpha.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2...
2015-12-07 23:15:29 URL:http://alpha.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2.sig [543/543] -> "/tmp/coreos-install.RsziI7s9km/coreos_production_image.bin.bz2.sig" [1]
Downloading, writing and verifying coreos_production_image.bin.bz2...
2015-12-07 23:16:13 URL:http://alpha.release.core-os.net/amd64-usr/current/coreos_production_image.bin.bz2 [231125324/231125324] -> "-" [1]
gpg: Signature made Thu Dec  3 02:51:35 2015 UTC using RSA key ID 1CB5FA26
gpg: key 93D2DCB4 marked as ultimately trusted
gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
gpg: Good signature from "CoreOS Buildbot (Offical Builds) <buildbot@coreos.com>" [ultimate]
blockdev: ioctl error on BLKRRPART: Device or resource busy

Both sync and blockdev --flushbufs work around this problem (for me at least).

$ diff -u /usr/bin/coreos-install /tmp/coreos-install 
--- /usr/bin/coreos-install     2015-11-05 02:11:46.000000000 +0000
+++ /tmp/coreos-install 2015-12-07 23:11:29.251420878 +0000
@@ -300,7 +300,7 @@
 fi

 # inform the OS of partition table changes
-blockdev --rereadpt "${DEVICE}"
+blockdev --flushbufs --rereadpt "${DEVICE}"

 if [[ -n "${CLOUDINIT}" ]] || [[ -n "${COPY_NET}" ]]; then
     # The ROOT partition should be #9 but make no assumptions here!

@crawford
Copy link
Contributor

crawford commented Dec 8, 2015

@coconutpilot neat! Would you mind filing a PR against https://github.com/coreos/init/blob/master/bin/coreos-install?

@jumanjiman
Copy link

@coconutpilot neat! Would you mind filing a PR against

👍

@coconutpilot
Copy link

Before I submitted my pull request I wanted to do some more testing. A simpler test case of the bug is (may need to run a few times):

blockdev --rereadpt /dev/sda & blockdev --rereadpt /dev/sda & wait

Looking at the kernel ioctl: http://lxr.free-electrons.com/source/block/ioctl.c#L184

its wrapped in a mutex so I am at a loss as to why this is happening?

As noted by @marineam in the first comment to the issue report it seems to be a race with udev, this is where udev does BLKRRPART:

https://github.com/systemd/systemd/blob/564c44436cf64adc7a9eff8c17f386899194a893/src/udev/udevd.c#L1043

This means my proposed fix blockdev --flushbufs only worked because it gave enough time for the rescan called by udevd to complete. Additionally @marineam had it figured out from the start.

A proposal for a solution:

  • add a pre-flight check that the install device has no mounted partitions
  • remove the call to blockdev --rereadpt as udevd already effectively makes the same ioctl.
  • add a sleep 1 between writing the filesystem to /dev/sdX and mounting. Perhaps udevadm settle could be used here, but docs focus on using that command at boot time.

Does this sound sane?

@danilochilene
Copy link

I got the same issue.

Only worked after deleting the VG.

@jotasixto
Copy link

When I tried install on disk the CoreOS stable (835.12.0) on XEN 3.16 PV Guest I had the BLKRRPAR error.
My mistake was that I had the LVM volume with malformed name. I was adding a number to the name to volume label, I think that this create a conflict interpreting the volume as a partition.
I rename the volume label without numbers at the end and it works fine.

@thereallukl
Copy link

I just hit it when I am installing CoreOS on machines that had CentOS installed and for me removing DM mapping prior to running installer fixes the issue:

dmsetup remove centos_cnlvr01r07s2-root
dmsetup remove centos_cnlvr01r07s2-home
dmsetup remove centos_cnlvr01r07s2-swap

centos_blah_blah names can be listed from /dev/mapper/*.

After that I can write CoreOS to /dev/sda

The-42 pushed a commit to avionic-design/pbs-platform-avionic-design that referenced this issue May 10, 2016
The sfdisk tool used for partitioning SD-cards has, especially in an
older 2.23.2 version on CentOS 7, problems re-reading the partitions it
just created. As even the --force parameter does not prevent sfdisk from
grinding to a halt, use --no-reread instead. This avoids races with
udev, as stated in numerous bug reports:

https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/942788
coreos/bugs#152
https://bugs.centos.org/view.php?id=986

Change-Id: I09c4a90c99e324abb8469d6bad1465713d7c8b32
Signed-off-by: Bert van Hall <bert.vanhall@avionic-design.de>
Reviewed-on: http://review.adnet.avionic-design.de/5446
Reviewed-by: Dirk Leber <dirk.leber@avionic-design.de>
@ivarec
Copy link

ivarec commented Sep 8, 2016

Just bumped into this issue on a machine that was previously a RAID-1 setup. Got it to work after thrashing both disks' MBR and rebooting:

BEWARE: this will trash all data in those disks!

dd if=/dev/zero of=/dev/sda bs=512 count=1
dd if=/dev/zero of=/dev/sdb bs=512 count=1
reboot

@crawford
Copy link
Contributor

crawford commented Sep 8, 2016

@haolez which version of CoreOS was being used to run coreos_install?

@ivarec
Copy link

ivarec commented Sep 8, 2016

@crawford 1122.2.0

EDIT Ah, sorry! I was using 1122.2.0 to run coreos-install. The one mentioned previously was what I was trying to install locally.

(Not a production environment)

@pfischer8989
Copy link

This still happens on 1153.0.0 as well. Fixed it with @haolez dd trick.

@crawford
Copy link
Contributor

@haolez interesting. coreos-install dd's over the MBR of the target disk, so one of those calls you mentioned shouldn't be necessary. Is the firmware booting from the wrong device? What is the behavior you are seeing?

@crawford
Copy link
Contributor

Closing due to inactivity.

@honeyankit
Copy link

Hi,

I am getting the below error while doing a pxeboot installation of CoreOS v1122.3.0 in virtualbox:

What we notice is that the coreos-install will fail and the last error indicates BLKRRPART: Device or resource busy

I tried no of suggestions mentioned above but none of it solve this problem.
Any help?

@Merlin83b
Copy link

Merlin83b commented Feb 27, 2017

Experienced this on current stable (1235.12.0). The storage was previously set up with LVM. As @wdennis stated, the install had actually completed and on reboot came up to CoreOS. The contents of my cloud-config.yaml hadn't been applied so had to reinstall again. The second time round there were no errors and cloud-config.yaml settings were applied.

@ghost
Copy link

ghost commented Dec 8, 2017

I was installing coreos 1576.4.0 on a machine that had ubuntu installed with LVM active. I experienced the same thing and so I tried to reproduce it on a different machine that also had LVM installed as well as the negative control of a machine with no LVM on a previous install. This is completely reproducible on 4 attempts. For the fourth attempt, I dd-ed a LVM Linux system on to a machine before doing the coreos install. Installs but no cloud-config details... SO reboot and re-install and that clears it up.

I was installing after booting into a live USB of ubuntu and I tried it with a live USB of a Centos as well. I wonder what is holding on to the disk. I will try later this weekend(we are getting a snow storm ... so not much to do out there) with adding the deactivate LVM to my install script...

@bgilbert
Copy link
Contributor

bgilbert commented Dec 9, 2017

It sounds as though the primary culprits are open LVM PVs and RAID volumes. coreos-install shouldn't automatically close them, but it could check for this case by running blockdev --rereadpt early (followed by udevadm settle) and refuse to proceed with installation if that fails.

An ambitious implementation could also check whether the disk has open LVM PVs (pvs or similar?), RAID volumes (/proc/mdstat), mounted filesystems (/proc/mounts), or swap (/proc/swaps) and provide remediation instructions if found.

@shivarammysore
Copy link

I did a disk install of CoreOS and I see the exact same issue.

blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partition on /dev/sda

Here are the steps that I took:

  1. Download ISO image and burn it on a CD
  2. Boot CD and login
  3. Run command: coreos-install -d /dev/sda -i ignition.json
    4. ignition.json is similar to https://gist.github.com/shivarammysore/28d2d5fe520805451a5ff47ed8f0dfe4 with the complete RSA key.

@crawford
Copy link
Contributor

I just ran into this and in my case, it was because the disk currently had Container Linux instead. When I booted the ISO image, it mounted /dev/sda3 and /dev/sda9 to /usr and / respectively. I had to use dd to destroy both the primary and secondary GPT tables.

@leonardochaia
Copy link

This is still happening and I can't figure out the fix from this issue, can someone point me to a workaround?

$ sudo coreos-install -d /dev/sdb -i ignition.json
Current version of CoreOS Container Linux stable is 2191.5.0
Downloading the signature for https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2...
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
2019-09-17 17:40:36 URL:https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2.sig [566/566] -> "/tmp/coreos-install.ILrflXrixd/coreos_production_image.bin.bz2.sig" [1]
Downloading, writing and verifying coreos_production_image.bin.bz2...
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
^[
2019-09-17 17:48:10 URL:https://stable.release.core-os.net/amd64-usr/2191.5.0/coreos_production_image.bin.bz2 [481116178/481116178] -> "-" [1]
gpg: Signature made mié 04 sep 2019 01:27:14 -03
gpg:                using RSA key FD986FB096482F906F55B2EA01C9CAE767B3CA0E
gpg: key 50E0885593D2DCB4 marked as ultimately trusted
gpg: checking the trustdb
gpg: marginals needed: 3  completes needed: 1  trust model: pgp
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
gpg: Good signature from "CoreOS Buildbot (Offical Builds) <buildbot@coreos.com>" [ultimate]
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb
blockdev: ioctl error on BLKRRPART: Device or resource busy
Failed to reread partitions on /dev/sdb

@shivarammysore
Copy link

I am using the Fedora CoreOS it works much better - https://github.com/coreos/coreos-installer/ - Download images from https://getfedora.org/coreos/download/ - If you have issues mailing lists are very responsive.

@leonardochaia
Copy link

I managed to 'fix' the install on my thumbdrive using @ivarec solution

dd if=/dev/zero of=/dev/sda bs=512 count=1

If we're gonna wipe the device anyways, maybe the core-install script could do this for us?

@Cytrian
Copy link

Cytrian commented May 14, 2020

We had a similar problem with Flatcar. After some investigation I found that udevadm settle triggered a filesystem check.
We masked the systemd-fsck@.service before starting the installer.
systemctl mask systemd-fsck@.service

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests