Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DmaFile returns EOPNOTSUP when reading (error code 95) #605

Open
vlovich opened this issue Sep 23, 2023 · 15 comments
Open

DmaFile returns EOPNOTSUP when reading (error code 95) #605

vlovich opened this issue Sep 23, 2023 · 15 comments

Comments

@vlovich
Copy link
Contributor

vlovich commented Sep 23, 2023

I'm sure I must be doing something wrong but I've been staring at this a few days and can't figure out what it might be. I'm consistently failing to read from DmaFile for some reason:

thread 'tests::basic' panicked at 'Failed to read what we just wrote: EnhancedIoError { source: Os { code: 95, kind: Uncategorized, message: "Operation not supported" }, op: "Reading", path: Some("./myfile2CyhlQ.test"), fd: Some(8) }', lib.rs:90:18

Here's the test:

    #[test]
    fn basic() {
        let normal_tmpfile = tempfile::Builder::new()
            .prefix("myfile")
            .suffix(".test")
            .tempfile_in(std::env::current_dir().unwrap())
            .unwrap();
        eprintln!("Running on {:?}", normal_tmpfile.path());

        let ex = glommio::LocalExecutorBuilder::new(glommio::Placement::Fixed(0))
            .make()
            .unwrap();

        ex.run(async move {
            let file = glommio::io::OpenOptions::new()
                .read(true)
                .write(true)
                .create(true)
                .dma_open(normal_tmpfile.path())
                .await
                .unwrap();

            let initial_read = file
                .read_at_aligned(0, 4096)
                .await
                .expect("Failed to read from empty file");
            assert_eq!(initial_read.len(), 0);

            let mut buf = file.alloc_dma_buffer(4096);
            eprintln!("Allocated {} bytes", buf.len());
            buf.as_bytes_mut().fill(0);
            eprintln!("Writing [0, {}]", buf.len());
            file.write_at(buf, 0)
                .await
                .expect("Failed to write [0, 4096)");

            eprintln!("Reading [0, 4096)");
            file.read_at(0, 4096)
                .await
                .expect("Failed to read what we just wrote");
        });
    }

Tested configurations:
Kernels tested:
6.5.4-1-MANJARO
6.5.4-2-ck
6.1.55-1-MANJARO
I've also tried pinning the test to P cores (13900K) in case that was some kind of issue

Filesystem Logical Block Size Physical Block Size Test Passes
tmpfs ? ? [x]
btrfs 4096 4096 [x]
ext4 4096 4096 []
xfs 4096 4096 []
xfs 512 512 []
f2f 4096 4096 []

Will try a different kernel version & update shortly if this turns out to be some kind of kernel bug with io_uring + O_DIRECT.

@glommer
Copy link
Collaborator

glommer commented Sep 23, 2023

You mean that this passes with tmpfs and btrfs? Or did you mean to imply the opposite by mistake?
I'm surprised this passes with tmpfs, tmpfs doesn't support O_DIRECT.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 23, 2023

Correct. It passes with tmpfs and btrfs and the read fails everywhere else (writes are fine & I've verified using hexdump & breaking before we attempt the read that the file is materialized correctly)

@glommer
Copy link
Collaborator

glommer commented Sep 23, 2023

Ok. We manually disable TMPFS, so this is really just the file falling back to non-O_DIRECT. Code is:

        let pollable = if (fstype.0 as u64) == (libc::TMPFS_MAGIC as u64) {
            PollableStatus::NonPollable(DirectIo::Disabled)
        }

I have no experience with BTRFS, so I don't really know what they support.

I am very puzzled about the others, this seems like a run-of-the-mill write+read pattern with the most conservative buffer sizes possible.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 23, 2023

Is there any debugging tips you can share? I haven't figured out how to figure out why the kernel is responding in this way.

@glommer
Copy link
Collaborator

glommer commented Sep 23, 2023

I'd try to write the equivalent in C and take glommio out of the picture. This will tell you right away if this is a kernel issue.
Assuming it is not a kernel issue (always the last one I'd blame), I'd try to add code deep into the call stack, as close as possible to the underlying io_uring sqe preparation to make sure nothing is getting corrupted in the way.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 23, 2023

You don't happen to have a a basic C example handy do you? Is https://unixism.net/2020/04/io-uring-by-example-part-1-introduction/ a good starting point?

@glommer
Copy link
Collaborator

glommer commented Sep 24, 2023

thinking again, there's an even more basic thing that we can do, which is run the unit test suite from liburing. Reading the tests is also a great way to see a lot of examples in action.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 24, 2023

Haven't tried the repro yet, but one thing I've narrowed it down to is this only happens when the file is created and written to. Then reads start failing. If I touch the file first then read after write works fine.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 25, 2023

Tracking it down further, it has to be a write of 4096 to position 0 followed by an equivalent read. If I write 8192 for example, this issue doesn't repro. So I think this might be something subtle within the io_uring layer.

NVM. Must have been running the test code incorrectly. Read after write fails regardless of the positions/length as long as the file didn't exist prior to opening.

@glommer this doesn't repro for you?

@vlovich
Copy link
Contributor Author

vlovich commented Sep 25, 2023

Here's some test cases of initial write followed by a read. No pattern is immediately discernible for me. The last two test cases in particular seem fairly weird...

write pos write len read pos read len passed
0 4096 0 4096 N
0 8192 0 4096 N
0 8192 4096 4096 N
4096 8192 0 4096 Y
4096 8192 4096 4096 N
4096 8192 4096 8192 N
4096 8192 8192 4096 N
4096 4096 0 4096 Y
4096 4096 4096 4096 N
12288 4096 0 4096 Y
12288 4096 0 8192 Y
12288 4096 0 12288 Y
12288 4096 0 16384 Y
12288 4096 4096 12288 Y
12288 4096 8192 12288 Y
12288 4096 4096 4096 Y
12288 4096 8192 4096 Y
12288 4096 4096 8192 Y
12288 4096 8192 8192 Y
12288 4096 12288 4096 N
12288 4096 4096 12288 Y

@glommer
Copy link
Collaborator

glommer commented Sep 26, 2023

I just tried this on my old box, and I can't reproduce on either xfs nor ext4.
My kernel is pretty old, I don't update in a while, that's still 5.19

@vlovich
Copy link
Contributor Author

vlovich commented Sep 26, 2023

https://gist.github.com/vlovich/9ab1bbb9b7c6081365c49b2ce35bf713 works. Not sure if that maps 1:1 with what's happening in glommio, but I have to imagine it's pretty close. @glommer does the C code look correct with respect to what glommio does internally?

The relevant nuances for repro may be:

  1. NVME is configured with natural 4096 bytes (LBA Format 1 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)). That being said, this repros on a device I have that only exports a 512 byte size so probably not this.
  2. Block size for partition is 4096 (across all filesystems tested except obviously tmpfs)
  3. Using LVM on top of LUKS.

All this being said, I can't imagine how any of the details matter here. @glommer any tips on what to start debugging within glommio?

@glommer
Copy link
Collaborator

glommer commented Sep 26, 2023

take a look at src/sys/uring.rs, and inspect things right before they are put into their SQEs.
For example, this is how the SQE gets prepped for a read:

                            sqe.prep_read(op.fd, buf.as_bytes_mut(), pos);

I am starting to think that maybe the buffers are not aligned somehow, and that's why we're getting this error?
We should align buffers, and it's hard to believe we'd get ENOTSUPP instead of some other error code for that, but maybe?

If you can - I'd try a similar thing without LUKS, and without LVM (between the two, I'd suspect LUKS more than LVM).
It's the most likely thing in that list to actually be causing an issue.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 28, 2023

I printed all the buffers used by sqe.prep_read and the pointer is always 4096 aligned.

@vlovich
Copy link
Contributor Author

vlovich commented Sep 28, 2023

Instrumented the kernel to get more insight. The line that's returning an error is this:
https://elixir.bootlin.com/linux/v6.5.5/source/block/blk-core.c#L732

I did disable the read/write queues as per https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ / https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performance which I'm assuming explains this. Not sure why btrfs is unimpacted as I think that partition also has it disabled.

Also unclear why it has to be a new file followed by a write but otherwise it works. Not sure why running a very simple io_uring C program directly worked...

  • EDIT: Even after enabling the work queues the error persists so I must be misunderstanding the exact failure path within the kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants