Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs-progs: cmds/qgroup: use sysfs to detect inconsistent status early #952

Open
wants to merge 27 commits into
base: devel
Choose a base branch
from

Conversation

adam900710
Copy link
Collaborator

Currently if "btrfs qgroup show" detects the qgroup is in an inconsistent, it will output a warning:

WARNING: qgroup data inconsistent, rescan recommended

But the detection is based on the tree search ioctl, and qgroup tree is only updated at transaction commit time.

This means if qgroup is marked inconsistent, and the transaction is not commit, there can be a huge window as long as 30s before "btrfs qgroup show" to give a proper warning about inconsistent qgroup numbers.

To address this window, use the
'/sys/fs/btrfs//qgroup/inconsistent' file to do extra check.

That file is updated at real time, thus there is no delay, and can give an early warning about inconsistent qgroup numbers.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1235765

kdave and others added 27 commits February 11, 2025 20:43
Unify the error message style a bit, print the relevant path if missing.

Signed-off-by: David Sterba <dsterba@suse.com>
The preferred error message should have a prefix with problem
description and then the errno description as we use the negative errno
convention almost everywhere.

- drop additional %d in the message if %m is present
- replace %d with %m

Signed-off-by: David Sterba <dsterba@suse.com>
The subvolumes created during mkfs are not printed in the summary
because btrfs_mkfs_fill_dir() deletes them from the list as they get
created.

Signed-off-by: David Sterba <dsterba@suse.com>
Print indicators in the summary if the subvolume is read-write,
read-only or default:

$ mkfs.btrfs --subvol ro:subvolro --subvol rw:subvolrw --subvol default-ro:defaultro --rootdir /rootdir/path img
...
Rootdir from:       /rootdir/path
  Compress:         no
  Subvolume (rw):   subvolrw
  Subvolume (ro):   subvolro
  Subvolume (dro):  defaultro
...

The path is relative to the rootdir path and may not be a subvolume in
the source directory so drop the rootdir as this may be confusing.

Signed-off-by: David Sterba <dsterba@suse.com>
Fix missing or wrong formatting, enhance descriptions or add more
references.

Signed-off-by: David Sterba <dsterba@suse.com>
Docker warns about CMD syntax, which is harmless in our case but let's
fix the warning by using the json (quoted strings in array) syntax.

  JSONArgsRecommended: JSON arguments recommended for CMD to prevent unintended behavior related to OS signals (line 29)

Explained in https://docs.docker.com/reference/build-checks/json-args-recommended .

Signed-off-by: David Sterba <dsterba@suse.com>
Enable CodeQL code scanning checks. For that the automatically detected
build steps don't work as the documentation is expected to build. Add
the manual steps.

Signed-off-by: David Sterba <dsterba@suse.com>
[ci skip]

Signed-off-by: David Sterba <dsterba@suse.com>
The values of UNITS_ modes are supposed to be unique and
non-overlapping. Due to wrong definition this was not true for
UNITS_DECIMAL which set 2 bits. This breaks code that strictly depends
on the uniqueness of the bits and not on the order of processing of the
constants.

Signed-off-by: David Sterba <dsterba@suse.com>
There's a special case when scrub rate is printed, the device sizes and
rate use different units. The size can be in terabytes while the rate
can be in hundreds of megabytes (contemporary 10-20T disks, 250MB/s).

The sizes use what is set on command line (or human readable by
default), while the rate is always human readable with exception to
the option --raw to provide a way to print the exact numbers without any
conversions.

This got broken in commit ec3c842 ("btrfs-progs: scrub status:
with --si, show rate in metric units") that forced the command line mode
to the rate as well.

Instead of that we need to detect the SI/IEC mode and set it to the
human readable format of rate.

Signed-off-by: David Sterba <dsterba@suse.com>
Add currently implemented types and update recommendations.

Signed-off-by: David Sterba <dsterba@suse.com>
Thanks to hunspell.

[ci skip]

Signed-off-by: David Sterba <dsterba@suse.com>
Add sections for each directory so it's more visible where the sections
start and end. Update formatting, enhance descriptions.

Signed-off-by: David Sterba <dsterba@suse.com>
The references used in :doc: need to reference relative path of the
document, otherwise this leads to a warning.

  btrfs-progs/Documentation/Send-receive.rst:19: WARNING: unknown document: 'dev-send-stream' [ref.doc]
  btrfs-progs/Documentation/dev/CmdLineConventions.rst:60: WARNING: unknown document: 'btrfstune' [ref.doc]

Signed-off-by: David Sterba <dsterba@suse.com>
The kernel commit 08fe4db170b419 ("Btrfs: Fix uninitialized root flags
for subvolumes") from 2011 sets the flag BTRFS_INODE_ROOT_ITEM_INIT on
root items, to work around a bug where flags and byte_limit weren't
being set.

Copy this behaviour in mkfs, to prevent the kernel from having to do it
on the first mount. We memset the btrfs_root_item, so there's no
corruption issue as there once was. We already do this in
btrfs_make_subvolume(), as otherwise the readonly flag of any subvolumes
would get reset.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel adds a zeroed btrfs_dev_stats_item for each device on the
first mount. Preempt this by doing it at mkfs time.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Follow the kernel by setting the BIG_METADATA incompat flag if nodesize
is greater than the page size.

This flag was introduced with commit 727011e07cbdf8 ("Btrfs: allow
metadata blocks larger than the page size") in 2010, as kernels before
2.6.36 would crash due to a buggy page cache implementation.
The flag has no real meaning anymore but we can at least set it at mkfs
time.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like all qgroup functions, if a qgroup is marked inconsistent, limit
will not work as expected.  In fact with recent kernels, limit and
qgroup number updating will be fully skipped if qgroup is already
inconsistent.

Add one extra note on `btrfs qgroup limit` subcommand for it.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1235765
Reported-by: Vojtech Lacina <vlacina@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
…wer redundancy

[BUG]
There is a bug report that when deleting a device using sysfs
/sys/block/<dev>/device/delete, the kernel module will still try to read
and write the device.

Normally it's fine as long as all chunks can tolerate that removed
device (e.g. all RAID1).

But the problem is when one is trying to lower the redundancy by
converting to another profile:

  # mkfs.btrfs -f -m raid1 -d raid1 /dev/sdd /dev/sde
  # mount /dev/sdd /mnt
  # echo 1 > /sys/block/sde/device/delete
  # btrfs balance start --force -mdup -dsingle /mnt

This will lead to the filesystem mounted RO, with the following error messages:

  sd 6:0:0:0: [sde] Synchronizing SCSI cache
  ata7.00: Entering standby power mode
  btrfs: attempt to access beyond end of device
  sde: rw=6145, sector=21696, nr_sectors = 32 limit=0
  btrfs: attempt to access beyond end of device
  sde: rw=6145, sector=21728, nr_sectors = 32 limit=0
  btrfs: attempt to access beyond end of device
  sde: rw=6145, sector=21760, nr_sectors = 32 limit=0
  BTRFS error (device sdd): bdev /dev/sde errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
  BTRFS error (device sdd): bdev /dev/sde errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
  BTRFS error (device sdd): bdev /dev/sde errs: wr 3, rd 0, flush 1, corrupt 0, gen 0
  btrfs: attempt to access beyond end of device
  sde: rw=145409, sector=128, nr_sectors = 8 limit=0
  BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
  BTRFS error (device sdd): bdev /dev/sde errs: wr 4, rd 0, flush 1, corrupt 0, gen 0
  btrfs: attempt to access beyond end of device
  sde: rw=14337, sector=131072, nr_sectors = 8 limit=0
  BTRFS warning (device sdd): lost super block write due to IO error on /dev/sde (-5)
  BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 1, corrupt 0, gen 0
  BTRFS error (device sdd): error writing primary super block to device 2
  BTRFS info (device sdd): balance: start -dconvert=single -mconvert=dup -sconvert=dup
  BTRFS info (device sdd): relocating block group 1372585984 flags data|raid1
  BTRFS error (device sdd): bdev /dev/sde errs: wr 5, rd 0, flush 2, corrupt 0, gen 0
  BTRFS warning (device sdd): chunk 2446327808 missing 1 devices, max tolerance is 0 for writable mount
  BTRFS: error (device sdd) in write_all_supers:4044: errno=-5 IO failure (errors while submitting device barriers.)
  BTRFS info (device sdd state E): forced readonly
  BTRFS warning (device sdd state E): Skipping commit of aborted transaction.
  BTRFS error (device sdd state EA): Transaction aborted (error -5)
  BTRFS: error (device sdd state EA) in cleanup_transaction:2017: errno=-5 IO failure
  BTRFS info (device sdd state EA): balance: ended with status: -5

[CAUSE]
Btrfs doesn't have any runtime device error handling, it fully rely on
the extra copy provided.

For the sysfs block device removal, normally there is a device shutdown
callback to the running fs, but unfortunately btrfs doesn't support this
callback yet.

Thus even with that device removed, btrfs will still access that
removed device (both read and write, even if they will fail).

Normally for a full RAID1 btrfs, it will still be fine reading/write the
fs as usual.  The proper action is to replace the
removed/missing/failing device with a newer one using `btrfs device
replace`.

But when doing the convert, btrfs will allocate new metadata chunks on
to the removed device (which will lose all writes).

And since the new metadata profile is DUP, which can not handle any
missing device of that metadata chunk, finally it triggers the final
protection at transaction commit time, and flips the filesystem to RO,
before causing any real data loss.

[DOC ENHANCEMENT]
Add a warning to the `convert` filter about the dangerous doing convert
to a lower redundancy profile when there is a known failing/removed
device.

And mention the proper way to handle such failing/missing device.

The root fix is to introduce a failing/removed device detection for
btrfs, but that will be a pretty big feature and will take quite some
time before landing it upstream.

Link: https://lore.kernel.org/linux-btrfs/2cb1d81e-12a8-4fb1-b3fc-e7e83d31e059@siddall.name/
Reported-by: Jeff Siddall <news@siddall.name>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There already is misc/065-csum-conversion-inject .

Signed-off-by: David Sterba <dsterba@suse.com>
The 065 and 066 are taken.

Signed-off-by: David Sterba <dsterba@suse.com>
There are two tests 064, and we're missing 061 so fill the gap.

Signed-off-by: David Sterba <dsterba@suse.com>
Use 'symbolic' as the term.

Signed-off-by: David Sterba <dsterba@suse.com>
The templates need to be renamed, so make it more suggestive by
prepending the expected name of the script test.sh. Also fix the
permissions and make it 755 so it's not missed later as this would not
execute the script.

Signed-off-by: David Sterba <dsterba@suse.com>
Minor fixes and updates that reflect current state.

[ci skip]

Signed-off-by: David Sterba <dsterba@suse.com>
For quick checks before a push of non-code changes we may want to do
only the spellchecking workflow. Any branch pushed matching the prefix
"codespell/" will be picked by this.

Signed-off-by: David Sterba <dsterba@suse.com>
Currently if "btrfs qgroup show" detects the qgroup is in an inconsistent,
it will output a warning:

  WARNING: qgroup data inconsistent, rescan recommended

But the detection is based on the tree search ioctl, and qgroup tree is
only updated at transaction commit time.

This means if qgroup is marked inconsistent, and the transaction is not
commit, there can be a huge window as long as 30s before "btrfs qgroup
show" to give a proper warning about inconsistent qgroup numbers.

To address this window, use the
'/sys/fs/btrfs/<fsid>/qgroup/inconsistent' file to do extra check.

That file is updated at real time, thus there is no delay, and can give
an early warning about inconsistent qgroup numbers.

Link: https://bugzilla.suse.com/show_bug.cgi?id=1235765
Signed-off-by: Qu Wenruo <wqu@suse.com>
@kdave kdave force-pushed the devel branch 3 times, most recently from 1480f4a to 103338d Compare February 17, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants