-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
viable strategy to support systemd #168
Comments
Perhaps related to this is the .zfs snapshot directory. I held off implementing this in the first zfs release for a couple of reasons. The most important of which is it's complicated the way it was done under OpenSolaris. Under Linux I have a feeling the right thing to do is leverage the automounter. This still needs to be investigated, but if your digging in to the automounter for systemd you might also consider .zfs snapshots. |
Unfortunately I have not been able to come up with a credible solution for this issue yet. I would also like to add that we do not really have a credible poweroff story, since I suspect the same problem afflicts the initscript. WORSE STILL, in the case of zfs-as-rootfs, the pool should ideally be exported or put in a safe state before powering off, but I don't believe we do that right now, and I dunno how to export the pool without removing access to the binaries required to finish the poweroff. |
oh shi- I closed the bug by accident. ANYWAY. Systemd automatically reads /etc/fstab to create the proper mount and automount units for everything upon boot, so either we extend systemd to do the same for zfs, or we manually write unit files and try to keep that mess in sync (honestly, very difficult to do). I would personally choose the first avenue. |
Of course, systemd's idea of what file systems are available should be updated every time a pool is imported or exported, and every time a filesystem is created / removed. |
There needs to be fixing for the following bug (essentially support mount -o remount,rw /, which the strace below shows a failure): ~/Projects/Mine/zfs@karen.dragonfear α: ~/Projects/Mine/zfs@karen.dragonfear α: |
The TODO items in the pull request exemplify the things we need to have in order to have full systemd support. |
Gentoo Linux now has systemd support in its ZFS ebuilds. The Sabayon Linux developers did the prerequisite work for this. https://bugs.gentoo.org/show_bug.cgi?id=475872 @Rudd-O I know that you are a fan of systemd support. There is an opportunity here to do some work to get this merged into ZFSOnLinux. I do not plan to do this myself because I do not use systemd. |
Replied. Their systemd work is a great attempt, but suboptimal. I'm aiming with my generator for something quite a bit more complex than what they do -- fully asynchronous and dependency-based pool imports with dataset discovery and the like. From cache file, to list of devices to wait, to devices waited upon, to pools imported, to datasets from these pools mounted, properly interspersed with filesystems from /etc/fstab. The tricky part is obtaining the dataset list before importing the pools (which is a requisite to be able to generate units for the datasets in a systemd generator) |
|
Yah, that's a piece of cake. The problem is that the devices you need to query this information, simply aren't available when the generator runs. Get it? At the time the generator runs, when I need to enumerate datasets for proper ordering on mount, that information is not available, because the devices containing the information, "don't exist" yet in /dev. Also, by the way, zdb -C doesn't show cache devices. That's a problem. |
@Rudd-O Would you file a separate issue for that? |
Yes. #1733 |
Support for systemd has been merged in to master, see 881f45c |
This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#168
All old pools will need to be destroyed, as the old on-disk format is no longer supported. The new DataObject has 3 sections: 1. size word: encoded length of DataObjectPhys (plus pad bits for future use, e.g. verioning) 2. bincode-encoded DataObjectPhys, which contains BlockId's and block sizes, not block contents. Entries are sorted by BlockId. 3. block contents, in order specified by DataObjectPhys. By keeping this separate from the DataObjectPhys, we can do a sub-object Get of just the first 2 sections and decode them on their own. The DataObjectPhys has the BlockId's and block sizes encoded as byte arrays, so that they can be decoded by serde in constant time. Code is added to access/byteswap each entry as needed. A new put_object_stream() method is added which takes a ByteStream, allowing us to put a DataObject to S3 without copying each of its blocks into a contiguous buffer. When reading a single block, a fast path is added to avoid constructing the entire DataObject; only O(log(n)) entries need to be accessed/byteswapped to binary search for the target block. In the future this can be further enhanced to do a sub-object Get of just the DataObjectPhys (or find the offset in the ZettaCache), followed by a sub-object Get of just the target block. bonus changes: * rename ingest_all() -> insert_all() to match insert() * change some trace! to super_trace!
systemd no longer mounts things by going through /etc/fstab at boot -- rather it implicitly creates filesystem units, which then use the kernel automounter to mount a filesystem when it is first accessed.
I have been thinking about this and perhaps the best strategy would be to create and remove systemd filesystem units as filesystems are discovered / created / removed, so they will be on disk and mounted.
The better advantage of doing this, in addition to parallel mounting of filesystems upon need, is that now we can interleave different types of filesystems and they will be correctly mounted on boot. This case, for example, does not work with our current initscript:
As you can see, / would be mounted by rc.sysinit just fine, but then rc.sysinit woudl fail to mount /var/lib, as the mountpoint does not exist because the /var zfs filesystem has not been mounted and will not be mounted until S01zfs start executes later on the boot sequence. Pretty bad. But with systemd, if we get the units right, we can stop relying on zfs mount -a (which would not work with interleaved filesystem types anyway) and start relying on the kernel automounter to do the work for us.
The text was updated successfully, but these errors were encountered: