Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

viable strategy to support systemd #168

Closed
Rudd-O opened this issue Mar 21, 2011 · 13 comments
Closed

viable strategy to support systemd #168

Rudd-O opened this issue Mar 21, 2011 · 13 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@Rudd-O
Copy link
Contributor

Rudd-O commented Mar 21, 2011

systemd no longer mounts things by going through /etc/fstab at boot -- rather it implicitly creates filesystem units, which then use the kernel automounter to mount a filesystem when it is first accessed.

I have been thinking about this and perhaps the best strategy would be to create and remove systemd filesystem units as filesystems are discovered / created / removed, so they will be on disk and mounted.

The better advantage of doing this, in addition to parallel mounting of filesystems upon need, is that now we can interleave different types of filesystems and they will be correctly mounted on boot. This case, for example, does not work with our current initscript:

/ ext4
/var zfs
/var/lib ext4

As you can see, / would be mounted by rc.sysinit just fine, but then rc.sysinit woudl fail to mount /var/lib, as the mountpoint does not exist because the /var zfs filesystem has not been mounted and will not be mounted until S01zfs start executes later on the boot sequence. Pretty bad. But with systemd, if we get the units right, we can stop relying on zfs mount -a (which would not work with interleaved filesystem types anyway) and start relying on the kernel automounter to do the work for us.

@behlendorf
Copy link
Contributor

Perhaps related to this is the .zfs snapshot directory. I held off implementing this in the first zfs release for a couple of reasons. The most important of which is it's complicated the way it was done under OpenSolaris. Under Linux I have a feeling the right thing to do is leverage the automounter. This still needs to be investigated, but if your digging in to the automounter for systemd you might also consider .zfs snapshots.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Jul 6, 2011

Unfortunately I have not been able to come up with a credible solution for this issue yet.

I would also like to add that we do not really have a credible poweroff story, since I suspect the same problem afflicts the initscript. WORSE STILL, in the case of zfs-as-rootfs, the pool should ideally be exported or put in a safe state before powering off, but I don't believe we do that right now, and I dunno how to export the pool without removing access to the binaries required to finish the poweroff.

@Rudd-O Rudd-O closed this as completed Jul 6, 2011
@Rudd-O Rudd-O reopened this Jul 6, 2011
@Rudd-O
Copy link
Contributor Author

Rudd-O commented Jul 6, 2011

oh shi- I closed the bug by accident.

ANYWAY. Systemd automatically reads /etc/fstab to create the proper mount and automount units for everything upon boot, so either we extend systemd to do the same for zfs, or we manually write unit files and try to keep that mess in sync (honestly, very difficult to do).

I would personally choose the first avenue.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Jul 6, 2011

Of course, systemd's idea of what file systems are available should be updated every time a pool is imported or exported, and every time a filesystem is created / removed.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Nov 12, 2011

There needs to be fixing for the following bug (essentially support mount -o remount,rw /, which the strace below shows a failure):


~/Projects/Mine/zfs@karen.dragonfear α:
sudo strace -f -eexecve /bin/mount / -o remount
execve("/bin/mount", ["/bin/mount", "/", "-o", "remount"], [/* 21 vars /]) = 0
Process 1397 attached
Process 1396 suspended
[pid 1397] execve("/sbin/mount.zfs", ["/sbin/mount.zfs", "ssd/RPOOL/fedora-16", "/", "-o", "rw,remount,noatime,xattr"], [/
17 vars /]) = 0
Process 1398 attached
Process 1397 suspended
[pid 1398] execve("/usr/sbin/exportfs", ["exportfs", "-v"], [/
17 vars */]) = 0
Process 1397 resumed
Process 1398 detached
[pid 1397] --- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1398, si_status=0, si_utime=0, si_stime=0} (Child exited) ---
filesystem 'ssd/RPOOL/fedora-16' cannot be mounted using 'mount'.
Use 'zfs set mountpoint=legacy' or 'zfs mount ssd/RPOOL/fedora-16'.
See zfs(8) for more information.
Process 1396 resumed
Process 1397 detached
--- {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1397, si_status=1, si_utime=0, si_stime=4} (Child exited) ---

~/Projects/Mine/zfs@karen.dragonfear α:
systemctl show remount-rootfs.service
Id=remount-rootfs.service
Names=remount-rootfs.service
Requires=systemd-stdout-syslog-bridge.socket
WantedBy=local-fs.target
Conflicts=shutdown.target
Before=fedora-readonly.service local-fs.target shutdown.target
After=systemd-readahead-collect.service systemd-readahead-replay.service fsck-root.service systemd-stdout-syslog-bridge.socket
Description=Remount Root FS
LoadState=loaded
ActiveState=failed
SubState=failed
FragmentPath=/lib/systemd/system/remount-rootfs.service
UnitFileState=static
InactiveExitTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
InactiveExitTimestampMonotonic=4694290
ActiveEnterTimestampMonotonic=0
ActiveExitTimestampMonotonic=0
InactiveEnterTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
InactiveEnterTimestampMonotonic=4993656
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=no
OnFailureIsolate=no
IgnoreOnIsolate=no
IgnoreOnSnapshot=no
DefaultControlGroup=name=systemd:/system/remount-rootfs.service
ControlGroup=cpu:/system/remount-rootfs.service name=systemd:/system/remount-rootfs.service
NeedDaemonReload=no
JobTimeoutUSec=0
ConditionTimestamp=Fri, 11 Nov 2011 05:54:04 -0800
ConditionTimestampMonotonic=4685445
ConditionResult=yes
Type=oneshot
Restart=no
NotifyAccess=none
RestartUSec=100ms
TimeoutUSec=1min 30s
ExecStart={ path=/bin/mount ; argv[]=/bin/mount / -o remount ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
UMask=0022


@Rudd-O
Copy link
Contributor Author

Rudd-O commented Nov 12, 2011

The TODO items in the pull request exemplify the things we need to have in order to have full systemd support.

@ryao
Copy link
Contributor

ryao commented Jul 14, 2013

Gentoo Linux now has systemd support in its ZFS ebuilds. The Sabayon Linux developers did the prerequisite work for this.

https://bugs.gentoo.org/show_bug.cgi?id=475872

@Rudd-O I know that you are a fan of systemd support. There is an opportunity here to do some work to get this merged into ZFSOnLinux. I do not plan to do this myself because I do not use systemd.

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Jul 22, 2013

Replied. Their systemd work is a great attempt, but suboptimal. I'm aiming with my generator for something quite a bit more complex than what they do -- fully asynchronous and dependency-based pool imports with dataset discovery and the like. From cache file, to list of devices to wait, to devices waited upon, to pools imported, to datasets from these pools mounted, properly interspersed with filesystems from /etc/fstab. The tricky part is obtaining the dataset list before importing the pools (which is a requisite to be able to generate units for the datasets in a systemd generator)

@behlendorf
Copy link
Contributor

The tricky part is obtaining the dataset list before importing the pools (which is a requisite to be able to generate units for the datasets in a systemd generator)

zdb could be extended to extract what you need from the pool without importing it. For example just getting the dataset names from an exported pool can be done like this. Extracting properties from those dataset would take a bit more work but is entirely doable.

zdb -ed <pool>

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Sep 3, 2013

Yah, that's a piece of cake. The problem is that the devices you need to query this information, simply aren't available when the generator runs. Get it? At the time the generator runs, when I need to enumerate datasets for proper ordering on mount, that information is not available, because the devices containing the information, "don't exist" yet in /dev.

Also, by the way, zdb -C doesn't show cache devices. That's a problem.

@ryao
Copy link
Contributor

ryao commented Sep 3, 2013

@Rudd-O Would you file a separate issue for that?

@Rudd-O
Copy link
Contributor Author

Rudd-O commented Sep 18, 2013

Yes. #1733

@behlendorf
Copy link
Contributor

Support for systemd has been merged in to master, see 881f45c

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 1, 2015
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#168
sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 21, 2020
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022
All old pools will need to be destroyed, as the old on-disk format is no
longer supported.

The new DataObject has 3 sections:
1. size word: encoded length of DataObjectPhys (plus pad bits for future
   use, e.g. verioning)
2. bincode-encoded DataObjectPhys, which contains BlockId's and block
   sizes, not block contents.  Entries are sorted by BlockId.
3. block contents, in order specified by DataObjectPhys.  By keeping
   this separate from the DataObjectPhys, we can do a sub-object
   Get of just the first 2 sections and decode them on their own.

The DataObjectPhys has the BlockId's and block sizes encoded as byte
arrays, so that they can be decoded by serde in constant time.  Code is
added to access/byteswap each entry as needed.

A new put_object_stream() method is added which takes a ByteStream,
allowing us to put a DataObject to S3 without copying each of its blocks
into a contiguous buffer.

When reading a single block, a fast path is added to avoid constructing
the entire DataObject; only O(log(n)) entries need to be
accessed/byteswapped to binary search for the target block.  In the
future this can be further enhanced to do a sub-object Get of just the
DataObjectPhys (or find the offset in the ZettaCache), followed by a
sub-object Get of just the target block.

bonus changes:
* rename ingest_all() -> insert_all() to match insert()
* change some trace! to super_trace!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

3 participants