viable strategy to support systemd #168

Rudd-O · 2011-03-21T18:42:53Z

systemd no longer mounts things by going through /etc/fstab at boot -- rather it implicitly creates filesystem units, which then use the kernel automounter to mount a filesystem when it is first accessed.

I have been thinking about this and perhaps the best strategy would be to create and remove systemd filesystem units as filesystems are discovered / created / removed, so they will be on disk and mounted.

The better advantage of doing this, in addition to parallel mounting of filesystems upon need, is that now we can interleave different types of filesystems and they will be correctly mounted on boot. This case, for example, does not work with our current initscript:

/ ext4
/var zfs
/var/lib ext4

As you can see, / would be mounted by rc.sysinit just fine, but then rc.sysinit woudl fail to mount /var/lib, as the mountpoint does not exist because the /var zfs filesystem has not been mounted and will not be mounted until S01zfs start executes later on the boot sequence. Pretty bad. But with systemd, if we get the units right, we can stop relying on zfs mount -a (which would not work with interleaved filesystem types anyway) and start relying on the kernel automounter to do the work for us.

behlendorf · 2011-03-21T23:13:59Z

Perhaps related to this is the .zfs snapshot directory. I held off implementing this in the first zfs release for a couple of reasons. The most important of which is it's complicated the way it was done under OpenSolaris. Under Linux I have a feeling the right thing to do is leverage the automounter. This still needs to be investigated, but if your digging in to the automounter for systemd you might also consider .zfs snapshots.

Rudd-O · 2011-07-06T11:52:54Z

Unfortunately I have not been able to come up with a credible solution for this issue yet.

I would also like to add that we do not really have a credible poweroff story, since I suspect the same problem afflicts the initscript. WORSE STILL, in the case of zfs-as-rootfs, the pool should ideally be exported or put in a safe state before powering off, but I don't believe we do that right now, and I dunno how to export the pool without removing access to the binaries required to finish the poweroff.

Rudd-O · 2011-07-06T11:54:20Z

oh shi- I closed the bug by accident.

ANYWAY. Systemd automatically reads /etc/fstab to create the proper mount and automount units for everything upon boot, so either we extend systemd to do the same for zfs, or we manually write unit files and try to keep that mess in sync (honestly, very difficult to do).

I would personally choose the first avenue.

Rudd-O · 2011-07-06T11:55:10Z

Of course, systemd's idea of what file systems are available should be updated every time a pool is imported or exported, and every time a filesystem is created / removed.

Rudd-O · 2011-11-12T00:07:28Z

There needs to be fixing for the following bug (essentially support mount -o remount,rw /, which the strace below shows a failure):

~/Projects/Mine/zfs@karen.dragonfear α:
sudo strace -f -eexecve /bin/mount / -o remount
execve("/bin/mount", ["/bin/mount", "/", "-o", "remount"], [/* 21 vars /]) = 0
Process 1397 attached
Process 1396 suspended
[pid 1397] execve("/sbin/mount.zfs", ["/sbin/mount.zfs", "ssd/RPOOL/fedora-16", "/", "-o", "rw,remount,noatime,xattr"], [/ 17 vars /]) = 0
Process 1398 attached
Process 1397 suspended
[pid 1398] execve("/usr/sbin/exportfs", ["exportfs", "-v"], [/ 17 vars */]) = 0
Process 1397 resumed
Process 1398 detached
[pid 1397] --- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1398, si_status=0, si_utime=0, si_stime=0} (Child exited) ---
filesystem 'ssd/RPOOL/fedora-16' cannot be mounted using 'mount'.
Use 'zfs set mountpoint=legacy' or 'zfs mount ssd/RPOOL/fedora-16'.
See zfs(8) for more information.
Process 1396 resumed
Process 1397 detached
--- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1397, si_status=1, si_utime=0, si_stime=4} (Child exited) ---

~/Projects/Mine/zfs@karen.dragonfear α:
systemctl show remount-rootfs.service
Id=remount-rootfs.service
Names=remount-rootfs.service
Requires=systemd-stdout-syslog-bridge.socket
WantedBy=local-fs.target
Conflicts=shutdown.target
Before=fedora-readonly.service local-fs.target shutdown.target
After=systemd-readahead-collect.service systemd-readahead-replay.service fsck-root.service systemd-stdout-syslog-bridge.socket
Description=Remount Root FS
LoadState=loaded
ActiveState=failed
SubState=failed
FragmentPath=/lib/systemd/system/remount-rootfs.service
UnitFileState=static
InactiveExitTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
InactiveExitTimestampMonotonic=4694290
ActiveEnterTimestampMonotonic=0
ActiveExitTimestampMonotonic=0
InactiveEnterTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
InactiveEnterTimestampMonotonic=4993656
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=no
OnFailureIsolate=no
IgnoreOnIsolate=no
IgnoreOnSnapshot=no
DefaultControlGroup=name=systemd:/system/remount-rootfs.service
ControlGroup=cpu:/system/remount-rootfs.service name=systemd:/system/remount-rootfs.service
NeedDaemonReload=no
JobTimeoutUSec=0
ConditionTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
ConditionTimestampMonotonic=4685445
ConditionResult=yes
Type=oneshot
Restart=no
NotifyAccess=none
RestartUSec=100ms
TimeoutUSec=1min 30s
ExecStart={ path=/bin/mount ; argv[]=/bin/mount / -o remount ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
UMask=0022

Rudd-O · 2011-11-12T00:08:29Z

The TODO items in the pull request exemplify the things we need to have in order to have full systemd support.

ryao · 2013-07-14T12:01:08Z

Gentoo Linux now has systemd support in its ZFS ebuilds. The Sabayon Linux developers did the prerequisite work for this.

https://bugs.gentoo.org/show_bug.cgi?id=475872

@Rudd-O I know that you are a fan of systemd support. There is an opportunity here to do some work to get this merged into ZFSOnLinux. I do not plan to do this myself because I do not use systemd.

Rudd-O · 2013-07-22T09:00:05Z

Replied. Their systemd work is a great attempt, but suboptimal. I'm aiming with my generator for something quite a bit more complex than what they do -- fully asynchronous and dependency-based pool imports with dataset discovery and the like. From cache file, to list of devices to wait, to devices waited upon, to pools imported, to datasets from these pools mounted, properly interspersed with filesystems from /etc/fstab. The tricky part is obtaining the dataset list before importing the pools (which is a requisite to be able to generate units for the datasets in a systemd generator)

behlendorf · 2013-08-08T23:48:09Z

The tricky part is obtaining the dataset list before importing the pools (which is a requisite to be able to generate units for the datasets in a systemd generator)

zdb could be extended to extract what you need from the pool without importing it. For example just getting the dataset names from an exported pool can be done like this. Extracting properties from those dataset would take a bit more work but is entirely doable.

zdb -ed <pool>

Rudd-O · 2013-09-03T10:33:58Z

Yah, that's a piece of cake. The problem is that the devices you need to query this information, simply aren't available when the generator runs. Get it? At the time the generator runs, when I need to enumerate datasets for proper ordering on mount, that information is not available, because the devices containing the information, "don't exist" yet in /dev.

Also, by the way, zdb -C doesn't show cache devices. That's a problem.

ryao · 2013-09-03T22:52:43Z

@Rudd-O Would you file a separate issue for that?

Rudd-O · 2013-09-18T01:18:52Z

Yes. #1733

behlendorf · 2014-04-16T17:41:11Z

Support for systemd has been merged in to master, see 881f45c

This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#168

All old pools will need to be destroyed, as the old on-disk format is no longer supported. The new DataObject has 3 sections: 1. size word: encoded length of DataObjectPhys (plus pad bits for future use, e.g. verioning) 2. bincode-encoded DataObjectPhys, which contains BlockId's and block sizes, not block contents. Entries are sorted by BlockId. 3. block contents, in order specified by DataObjectPhys. By keeping this separate from the DataObjectPhys, we can do a sub-object Get of just the first 2 sections and decode them on their own. The DataObjectPhys has the BlockId's and block sizes encoded as byte arrays, so that they can be decoded by serde in constant time. Code is added to access/byteswap each entry as needed. A new put_object_stream() method is added which takes a ByteStream, allowing us to put a DataObject to S3 without copying each of its blocks into a contiguous buffer. When reading a single block, a fast path is added to avoid constructing the entire DataObject; only O(log(n)) entries need to be accessed/byteswapped to binary search for the target block. In the future this can be further enhanced to do a sub-object Get of just the DataObjectPhys (or find the offset in the ZettaCache), followed by a sub-object Get of just the target block. bonus changes: * rename ingest_all() -> insert_all() to match insert() * change some trace! to super_trace!

Rudd-O closed this as completed Jul 6, 2011

Rudd-O reopened this Jul 6, 2011

behlendorf closed this as completed Apr 16, 2014

sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 21, 2020

DLPX-69914 disable automatic scrubs (openzfs#168)

d89cb18

jittygitty mentioned this issue Apr 30, 2022

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only #13349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

viable strategy to support systemd #168

viable strategy to support systemd #168

Rudd-O commented Mar 21, 2011

behlendorf commented Mar 21, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Nov 12, 2011

Rudd-O commented Nov 12, 2011

ryao commented Jul 14, 2013

Rudd-O commented Jul 22, 2013

behlendorf commented Aug 8, 2013

Rudd-O commented Sep 3, 2013

ryao commented Sep 3, 2013

Rudd-O commented Sep 18, 2013

behlendorf commented Apr 16, 2014

viable strategy to support systemd #168

viable strategy to support systemd #168

Comments

Rudd-O commented Mar 21, 2011

behlendorf commented Mar 21, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Jul 6, 2011

Rudd-O commented Nov 12, 2011

Rudd-O commented Nov 12, 2011

ryao commented Jul 14, 2013

Rudd-O commented Jul 22, 2013

behlendorf commented Aug 8, 2013

Rudd-O commented Sep 3, 2013

ryao commented Sep 3, 2013

Rudd-O commented Sep 18, 2013

behlendorf commented Apr 16, 2014