Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs recv occasionally hangs instead of running on recent git #13232

Closed
rincebrain opened this issue Mar 18, 2022 · 4 comments
Closed

zfs recv occasionally hangs instead of running on recent git #13232

rincebrain opened this issue Mar 18, 2022 · 4 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rincebrain
Copy link
Contributor

rincebrain commented Mar 18, 2022

System information

Type Version/Name
Distribution Name Debian
Distribution Version 11.2
Kernel Version 5.10.103-v8+
Architecture aarch64
OpenZFS Version 2feba9a

Describe the problem you're observing

(I've seen it happen much more rarely on my faster systems, but it seems to occur more frequently on my Pi 4)
(I'll run this down, but figured I'd open an issue while I was doing so...)

When trying to benchmark a PR recently that didn't touch the send/recv code, I observed that sometimes, "zfs recv" just hung indefinitely without ever starting to consume much from stdin. Of course, gdb/strace/etc just makes zfs recv crash, so that's not easily investigable...

So I tested it, and found that it sometimes happened with stock git, too. The recv command wasn't even interesting, either - zfs recv -o compression={off,lz4} some/pool/path.

Describe how to reproduce the problem

  • Run zfs recv a number of times, preferably on a slow system
  • Wait for it to eventually have one instance where having pv/mbuffer/etc in the middle says recv isn't reading any bytes, and even if you leave it for hours, it never makes progress or logs anything to dmesg
  • ...profit?

Include any warning/errors/backtraces from the system logs

None, really.

When I turned on all the zfs_flags knobs, I got back a couple of:
1647571080 dsl_dir.c:1330:dsl_dir_tempreserve_impl(): error 28
but it's not clear to me how interesting that is.

e: If I go kick it into core dumping while "stuck", this is the backtrace of the only thread:

(gdb) thread apply all bt

Thread 1 (Thread 0x7f979c8040 (LWP 3526503)):
#0  ioctl () at ../sysdeps/unix/sysv/linux/aarch64/ioctl.S:26
#1  0x0000007f98043154 in lzc_ioctl_fd (fd=<optimized out>, request=<optimized out>, zc=<optimized out>) at os/linux/libzfs_core_ioctl.c:29
#2  0x0000007f98041edc in recv_impl (snapname=snapname@entry=0x7fe13dd4d0 "pipool/baseline/try15@baseline", recvdprops=<optimized out>, localprops=0x556ed03b90, wkeydata=<optimized out>, wkeylen=85, origin=0x7fe13dbed0 "", origin@entry=0x7fe13e2d98 "", force=(unknown: 0x6ed0c3e0), resumable=B_FALSE, raw=<optimized out>, input_fd=<optimized out>, begin_record=<optimized out>, read_bytes=<optimized out>, errflags=<optimized out>, errors=<optimized out>) at libzfs_core.c:1128
#3  0x0000007f980428ac in lzc_receive_with_cmdprops (snapname=snapname@entry=0x7fe13dd4d0 "pipool/baseline/try15@baseline", props=<optimized out>, cmdprops=<optimized out>, wkeydata=<optimized out>, wkeylen=<optimized out>, origin=origin@entry=0x7fe13e2d98 "", force=<optimized out>, resumable=<optimized out>, raw=B_FALSE, input_fd=0, input_fd@entry=1852143205, begin_record=0x7fe13e2d98, begin_record@entry=0x696c65736162006e, cleanup_fd=<optimized out>, cleanup_fd@entry=-1, read_bytes=<optimized out>, read_bytes@entry=0x7fe13e2bc0, errflags=<optimized out>, errflags@entry=0x7fe13e2bc8, action_handle=action_handle@entry=0x0, errors=0x7fe13dbe48, errors@entry=0x7fe13e2bd8) at libzfs_core.c:1258
#4  0x0000007f98091e90 in zfs_receive_one (hdl=hdl@entry=0x556ed01570, infd=1852143205, infd@entry=0, tosnap=tosnap@entry=0x556ed0c3e0 "pipool/baseline/try15", originsnap=originsnap@entry=0x0, flags=flags@entry=0x7fe13e3418, drr=0x200011, drr@entry=0x7fe13e2c60, drr_noswap=0x696c65736162006e, sendfs=sendfs@entry=0x7fe13e2b60 "evenfaster/haulin", stream_nv=<optimized out>, stream_nv@entry=0x0, stream_avl=<optimized out>, stream_avl@entry=0x0, top_zfs=<optimized out>, top_zfs@entry=0x7fe13e3340, finalsnap=<optimized out>, finalsnap@entry=0x0, cmdprops=<optimized out>, cmdprops@entry=0x556ed0c600) at libzfs_sendrecv.c:4833
#5  0x0000007f980931d0 in zfs_receive_impl (hdl=hdl@entry=0x556ed01570, tosnap=tosnap@entry=0x556ed0c3e0 "pipool/baseline/try15", originsnap=0x0, flags=flags@entry=0x7fe13e3418, infd=infd@entry=0, sendfs=0x7fe13e2b60 "evenfaster/haulin", sendfs@entry=0x0, stream_nv=stream_nv@entry=0x0, stream_avl=stream_avl@entry=0x0, top_zfs=top_zfs@entry=0x7fe13e3340, finalsnap=finalsnap@entry=0x0, cmdprops=cmdprops@entry=0x556ed0c600) at libzfs_sendrecv.c:5304
#6  0x0000007f98094284 in zfs_receive (hdl=0x556ed01570, tosnap=0x556ed0c3e0 "pipool/baseline/try15", props=0x556ed0c600, flags=flags@entry=0x7fe13e3418, infd=infd@entry=0, stream_avl=stream_avl@entry=0x0) at libzfs_sendrecv.c:5346
#7  0x00000055620c1a0c in zfs_do_receive (argc=1, argv=0x556ed0d018) at zfs_main.c:4859
#8  0x00000055620b6b44 in main (argc=6, argv=<optimized out>) at zfs_main.c:8669
@rincebrain rincebrain added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 18, 2022
@rincebrain
Copy link
Contributor Author

Well, I just reproduced this on zfs-2.0.0-rc1-217-gfd654e412, so much for this being new. I guess I just don't run recvs on really slow setups enough to notice...maybe?

@rincebrain
Copy link
Contributor Author

...you know, I wasn't really expecting this to be the function to blame, but it appears that we get stuck in:

[<0>] __switch_to+0xd4/0x128
[<0>] pipe_read+0x300/0x3d8
[<0>] __kernel_read+0x114/0x2b0
[<0>] kernel_read+0x5c/0x70
[<0>] zfs_file_read_impl+0x104/0x130 [zfs]
[<0>] zfs_file_read+0xb4/0x180 [zfs]
[<0>] receive_read+0x8c/0x1e0 [zfs]
[<0>] receive_read_payload_and_next_header+0x48/0x280 [zfs]
[<0>] dmu_recv_stream+0x5f0/0xa90 [zfs]
[<0>] zfs_ioc_recv_impl+0x21c/0xd50 [zfs]
[<0>] zfs_ioc_recv+0x17c/0x244 [zfs]
[<0>] zfsdev_ioctl_common+0x6d8/0x830 [zfs]
[<0>] zfsdev_ioctl+0x74/0x100 [zfs]
[<0>] __arm64_sys_ioctl+0xb0/0xf0
[<0>] el0_svc_common.constprop.3+0xc0/0x1f8
[<0>] do_el0_svc+0x2c/0x98
[<0>] el0_svc+0x20/0x30
[<0>] el0_sync_handler+0x90/0xb8
[<0>] el0_sync+0x180/0x1c0

so I guess pipe_read's behavior isn't what we think it should be?

@rincebrain
Copy link
Contributor Author

zrepl/zrepl#424 seems extremely relevant.

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Mar 18, 2023
@stale stale bot closed this as completed Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

1 participant