-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the device and inode numbers from the API. #81
Conversation
q: If this merges, would the behavior change on wasi_snapshot_preview1 in wasi-sdk 17 that triggered the discussion go back to before? |
What we might do is change wasi-libc to remove st_dev and st_ino from its struct stat, even though it's still using preview1 which provides that info. And maybe add |
but i'm not sure about for porting existing apps, i guess a |
A file id containing a (dev_t, ino_t) would still have the global device number space problem, would still be inefficient and/or awkward to implement on platforms that don't have inodes (eg. mentioned here), and it would still have the information leak problem. Another problem that I don't remember having mentioned before is that Windows' ReFS requires 128 bits for a full file identifier. Another is the the ID for a file can change over time and if we return the ID as integer values, those values could become stale. This is a big change, to be sure. I didn't initially think this would be a good idea and I still have reservations. I don't know for sure that we can do this without hitting some compatibility constraint that we won't be willing to break. But if we can manage to make this change, it'll make the platform significantly simpler, and make it easier to implement WASI in more interesting ways. |
i think standards allow rename and hard links among filesystems.
that's why i mentioned the "this file doesn't have id" case.
we can make "wasi fileid" big enough.
do you mean the notes on FAT in the page?
i feel a bit ambivalent too. |
@jerbob92 since you are an ultimate end user here, any thoughts on above? |
It's a good point. And since it wasn't that useful anyway, I've now just removed While thinking about this, I also renamed
Ah, right. So we'd only have two of those three problems :-).
That is true; 128 bits for an
Yes; I should have linked to |
another aside: wazero team are starting to try to pass the wasi-testsuite, which would be maybe affected by any decision here. Since the test filesystem is a real FS, we can leak the real inodes to pass it. If we get to a point where wasm is rather expensive on account of inodes we may need a flag to allow folks to opt into "extended stat" or something while this topic settles. $ wazero run -hostlogging filesystem -mount=/Users/adrian/oss/wasi-testsuite/tests/c/testsuite:/ fdopendir-with-access.wasm
==> wasi_snapshot_preview1.fd_prestat_get(fd=3)
<== (prestat={pr_name_len=1},errno=ESUCCESS)
==> wasi_snapshot_preview1.fd_prestat_dir_name(fd=3)
<== (path=/,errno=ESUCCESS)
==> wasi_snapshot_preview1.fd_prestat_get(fd=4)
<== (prestat=,errno=EBADF)
==> wasi_snapshot_preview1.fd_fdstat_get(fd=3)
<== (stat={filetype=DIRECTORY,fdflags=,fs_rights_base=,fs_rights_inheriting=},errno=ESUCCESS)
==> wasi_snapshot_preview1.path_open(fd=3,dirflags=SYMLINK_FOLLOW,path=fs-tests.dir/fopendir.dir,oflags=DIRECTORY,fs_rights_base=,fs_rights_inheriting=,fdflags=)
<== (opened_fd=4,errno=ESUCCESS)
==> wasi_snapshot_preview1.fd_readdir(fd=4,buf=70048,buf_len=4096,cookie=0,result.bufused=70028)
<== errno=ESUCCESS
==> wasi_snapshot_preview1.path_filestat_get(fd=4,flags=,path=file-1)
<== (filestat={filetype=REGULAR_FILE,size=0,mtim=1671602425719797759},errno=ESUCCESS)
==> wasi_snapshot_preview1.path_filestat_get(fd=4,flags=,path=file-0)
<== (filestat={filetype=REGULAR_FILE,size=0,mtim=1671602425719712300},errno=ESUCCESS)
==> wasi_snapshot_preview1.fd_close(fd=4)
<== errno=ESUCCESS
Assertion failed: inodes[0] != inodes[1] (testsuite/fdopendir-with-access.c: main: 47)
error instantiating wasm binary: module[] function[_start] failed: wasm error: unreachable
wasm stack trace:
.abort()
.__assert_fail(i32,i32,i32,i32)
.main(i32,i32) i32
.__main_void() i32 |
a4d79c8
to
54ade91
Compare
Another data point: Linux' overlayfs can exhibit changing st_ino and also st_dev for the same filesystem object. You can mitigate this somewhat by mounting with xino=on but that has a (small) performance cost and can also not 100% guarantee stable inodes. edit: That is to say, because overlayfs might be used by wasi implementations (it is a very popular option for fs virtualization), this is perhaps an argument to not expose these ids to WASI. |
From Mastodon, this reply points out that there are applications which use the inode number to supplement their "has this file changed" checks:
To address this use case, I've now added a |
And as yet another data point, Btrfs as mounted over NFS can reuse the same device,inode pairs. Each Btrfs subvolume independently handles inode numbering. While it locally presents a different device number for each subvolume, when mounted over NFS they all get the same device number, thus resulting in potential inode duplication. It is also not allowed to rename across subvolumes. https://lwn.net/Articles/866582/ Has removing hardlinks entirely been considered? Or is that too much of a break with POSIX? |
I do think that there are times where we just have to say to users "if you care about what happens, consider not using NFS" rather than trying to have WASI or WASI applications try to superheroically fix it.
I don't recall anyone having proposed this yet. I don't know that we need to prohibit it, but it would be good to at least document that one generally shouldn't depend on it succeeding. I've filed #83 to track this. |
it's a bit moot as FAT doesn't have hardlinks. |
wasi-filesystem.wit.md
Outdated
## `metadata-hash` | ||
```wit | ||
/// Return a hash of the metadata associated with a filesystem object referred | ||
/// to by a descriptor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my honest impression is that it belongs to applications.
with the current description, i suspect an application will need to combine it with other properties like timestamp etc by itself anyway.
i feel it's simpler to provide get-fileid-at
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would indeed be simpler, but it'd have the problems discussed above: not all desirable impls have a meaningful fileid, and an exposed fileid leaks host information. Also, a fileid by itself isn't unique, and is often expected to be paired with a deviceid, but that would require a deviceid namespace. The nice thing about metadata-hash is that it avoids assigning any meaning to the number, so it can be computed in different ways depending on what the host implementation has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a fileid by itself isn't unique, and is often expected to be paired with a deviceid, but that would require a deviceid namespace.
I was interpeting the proposal as fileid being a combination of the device id and inode number.
The nice thing about metadata-hash is that it avoids assigning any meaning to the number, so it can be computed in different ways depending on what the host implementation has.
A fileid could be a hash of everything that stays fixed for the lifetime of a file too, or assigned through any other consistent scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fileid is the device id and inode number, then how would it avoid the problems at the top of this issue?
If the fileid is a hash of everything that stays fixed for the lifetime of a file, how is that different from the metadata-hash
I just proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fileid is the device id and inode number, then how would it avoid the problems at the top of this issue?
Hashing them together combined with a secret value specified by the wasi runtime that is unknown to the wasm module (such that brute-forcing a map from hash to input is not possible) would avoid metadata leakage. As for some filesystems not having inodes, that is just as much an issue with metadata-hash as all other metadata can be non-unique between two different files. Also in both cases it would be possible to have the wasi runtime assign the id/hash based on eg the file path.
If the fileid is a hash of everything that stays fixed for the lifetime of a file, how is that different from the metadata-hash I just proposed?
The metadata hash you specified includes the mtime, which can change without changing the identity of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now added documentation mentioning that a secret value known only to the implementation may be used.
52a1297
to
6599875
Compare
I've now rebased this, including updating it to the new proposal repo format. I believe I've addressed all the feedback above, and that there were no show-stopping concerns. This PR is a tradeoff, and some things will be harder, but I think overall it's the right approach. I'll leave this open a little longer here in case anyone has any further feedback. |
Style nit: |
Sure, I've now added a patch for that. |
Is there any way to know how this is likely to be used in wasi-libc? For example, the root thing was trying to avoid a fan-out of stat, and how this doesn't turn into a fan-out of hash based metadata fetching. It is both hard to process the impact from ABI (mentally compile this into the canonical one) and also integration, except folks who know exactly how these signatures will end up and any change in the cardinality of host callbacks. So, if you've already mentally compiled this, can you please share? |
I expect wasi-libc will remove its
I haven't mentally compiled an ABI for this. When doing API design with Wit, I find it most useful to just think in terms of Wit, not in terms of the various ABIs that may be generated from it. |
As discussed [here], remove the fields which correspond to `st_dev`, `st_ino`, and `d_ino` in POSIX from the stat and directory entry structs. - Device numbers assume the existence of a global device number space, which creates implicit relationships between otherwise unrelated components. - Not all filesystem implementations have these numbers. And some that do have these numbers require extra implementation cost to retrieve them. - These numbers leak potentially sensitive or identifying information from the underlying filesystem implementation. In their place, provide some functions, `is-same-object`, `metadata-hash`, and `metadata-hash-at`, for explicitly testing whether two handles are the same file or have the same metadata, respectively. This doesn't cover all possible use cases for device and inode numbers, but we can add more functions as need arises. [here]: #65 (comment)
As discussed [here], remove the fields which correspond to `st_dev`, `st_ino`, and `d_ino` in POSIX from the stat and directory entry structs. - Device numbers assume the existence of a global device number space, which creates implicit relationships between otherwise unrelated components. - Not all filesystem implementations have these numbers. And some that do have these numbers require extra implementation cost to retrieve them. - These numbers leak potentially sensitive or identifying information from the underlying filesystem implementation. In their place, provide some functions, `is-same-object`, `metadata-hash`, and `metadata-hash-at`, for explicitly testing whether two handles are the same file or have the same metadata, respectively. This doesn't cover all possible use cases for device and inode numbers, but we can add more functions as need arises. [here]: #65 (comment)
72ace35
to
1728295
Compare
1728295
to
58e1ea5
Compare
Since there seem to be no major outstanding objections, I think it makes sense to merge this. Wasi-filesystem is still in Phase 2, so there's still time to make changes if any concerns come up; please file new issues! |
As discussed here, remove the fields which correspond to
st_dev
,st_ino
, andd_ino
in POSIX from the stat and directory entry structs.Device numbers assume the existence of a global device number space, which creates implicit relationships between otherwise unrelated components.
Not all filesystem implementations have these numbers. And some that do have these numbers require extra implementation cost to retrieve them.
These numbers leak potentially sensitive or identifying information from the underlying filesystem implementation.
In their place, provide two functions,
is-same-file
andis-same-mountpoint
, for explicitly testing whether two handles are the same file or are on the same mountpoint, respectively. This doesn't cover all possible use cases for device and inode numbers, but we can add more functions as need arises.