Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Rust driver that touches real hardware: PR 2/6 #276

Merged
merged 3 commits into from
May 25, 2021

Conversation

TheSven73
Copy link
Collaborator

Simple Rust driver that touches real h/w: step 2 of 6

@alex and @ojeda suggested that #254 should be split into smaller PRs, so they can be reviewed and merged individually.

Roadmap:

  • platform_device
  • devicetree (of) matching (you are here)
  • probe
  • DrvData
  • regmap/iomem
  • finally the actual hwrng driver

/// # Invariants
///
/// The last array element is always filled with zeros (the default).
pub struct OfMatchTable(Box<[bindings::of_device_id; 2]>);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to support variable-size match tables by using const generics here, eg:

pub struct OfMatchTable<const N: usize>(Box<[bindings::of_device_id; N+1]>);

and everything appeared to work great, until building rustdoc. That failed with the internal compiler error described here.

/// # Invariants
///
/// The last array element is always filled with zeros (the default).
pub struct OfMatchTable(Box<[bindings::of_device_id; 2]>);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public APIs, we're trying to avoid dependencies on bindings. In fact, eventually we want to make bindings private.

So I suggest a newtype here that wraps bindings::of_device_id. Or even some Rust struct that is easy to use by clients, accompanied by a (private) const function that converts that into an array of bindings::of_device_id -- which is what we do with file operations, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wedsonaf thank you so much for the review, really appreciate it !

I'm not sure I understand your feedback - the bindings:: member inside the public OfMatchTable is private, so clients outside of.rs don't know, see or care that it contains a (private) bindings:: type?

My apologies if I have misunderstood. Could you provide an example of what your concern is? That usually helps me "see the light".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad. For some reason I read this as a type alias instead of a new type. I even suggested that you do exactly what you're already doing.

// - `of_match_tbl` is either:
// - a raw pointer created from a heap structure which is never deallocated,
// therefore it lives at least as long as the module, or
// - null.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to support 'static pointers, which would be created by const functions. Unless, of course, there is a technical reason that prevents us from doing so.

Copy link
Collaborator Author

@TheSven73 TheSven73 May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 100% that the structure should be 'static, const, and created at build time - I even indicated that in the PR's first commit message.

If we can go that route, your remaining concerns disappear, correct?

  • match table would be a'static pointer
  • we would not have to use a Box (ugh that looked bad - I know !)
  • length checks would happen at build time

In this PR's v2, I can get rid of the ugly Box issue, and make sure that it gets cleaned up properly, without resulting in a memory leak.

The problem is - I have no idea (yet!) how to create the table statically at build time.

Could I convince you to defer this issue until a later PR?
Can't get the perfect get in the way of the good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I tried using const generics when creating the match table structure. That's probably one of the things we'll need to create the table at build time? I ran into internal compiler errors, which is kind of an indication we're in uncharted territory here...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we don't have the box issue, I think we could get this is and improve on it later.

///
/// The resulting raw pointer is suitable to be assigned to a
/// `bindings::device_driver::of_match_table`.
pub(crate) fn into_table_ptr(self) -> *const bindings::of_device_id {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intentionally consuming the Box here?

I think I understand why you'd want to do this, but I'm not a big fan of APIs that leak by design. Ideally we should have a way to recover and free this box if we unregisters the platform device.

self.0.into_pointer().cast()
}

fn new_of_device_id(compatible: &CStr<'static>) -> Result<bindings::of_device_id> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If feels like the length check could be done at compile time so that we reject bad strings as early as possible. We may need a macro + const fn + static_assert combination but I think it's doable. (And likely preferable if we can make the ergonomics good enough.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.len() is const so replacing line 48 with a static_assert should work out of the box

@ksquirrel

This comment has been minimized.

@TheSven73 TheSven73 force-pushed the rust-for-linux-pdev-pr2 branch from 453d7ea to a0d9af8 Compare May 19, 2021 20:57
@ksquirrel

This comment has been minimized.

Sven Van Asbroeck added 3 commits May 19, 2021 17:08
Implement the bare minimum required so a Rust `platdev` can match
with devices described in the devicetree. Driver and device will
match if their `compatible` strings match. The Linux kernel will
instantiate one device per match.

Ideally, the of table should be const, and created at build time,
but we haven't figured out how to accomplish that yet.

Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Devices described in the devicetree with the following `compatible`
string will now match:

`brcm,bcm2835-rng`

Example devicetree .dts:

```dts
&some_bus {
        rng@7e104000 {
                compatible = "brcm,bcm2835-rng";
                reg = <0x7e104000 0x10>;
        };

        rng@7e104040 {
                compatible = "brcm,bcm2835-rng";
                reg = <0x7e104040 0x10>;
        };
};
```

This will instantiate two hardware random-number generator devices.

Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Now that Rust `platdev` drivers can match with devicetree devices
using a `compatible` string, we no longer require the forked
devicetrees.

The standard Raspberry Pi devicetrees will work fine.

Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
@TheSven73 TheSven73 force-pushed the rust-for-linux-pdev-pr2 branch from a0d9af8 to c02591f Compare May 19, 2021 21:09
@ksquirrel
Copy link
Member

Review of c02591f2b011:

  • ✔️ Commit fb3fdcb: Looks fine!
  • ✔️ Commit 6e4043a: Looks fine!
  • ✔️ Commit c02591f: Looks fine!

@TheSven73
Copy link
Collaborator Author

TheSven73 commented May 19, 2021

v1 -> v2

wedsonaf:

  • clean up of_table after platform_driver_unregister()

TheSven73:

  • implement PointerWrapper for OfMatchTable which is idiomatic for obtaining kernel pointers to store in kernel structures

I opted to clean up the table by keeping a copy of the PointerWrapper raw pointer, and clean it up on Drop. I think that looks quite ugly though:

impl Drop for Registration {
    fn drop(&mut self) {
        if self.registered {
            // SAFETY: if `registered` is true, then `self.pdev` was registered
            // previously, which means `platform_driver_unregister` is always
            // safe to call.
            unsafe { bindings::platform_driver_unregister(&mut self.pdrv) }
        }
        if let Some(ptr) = self.of_table {
            // SAFETY: `ptr` came from an `OfMatchTable`.
            let tbl = unsafe { OfMatchTable::from_pointer(ptr) };
            drop(tbl);
        }
    }
}

My immediate instinct would be to find some general solution. Maybe we can convert the PointerWrapper into a PointerHolder which allows the raw pointer to be retrieved, while keeping ownership of the PointerWrapper ? Something along the following lines:

/// Used to retain ownership of a [`PointerWrapper`] after the `into_pointer()` call.
pub struct PointerHolder<T: PointerWrapper> {
    ptr: *const c_types::c_void,
    _phantom: PhantomData<T>,
}

impl<T: PointerWrapper> Drop for PointerHolder<T> {
    fn drop(&mut self) {
        let obj = unsafe { T::from_pointer(self.ptr) };
        drop(obj);
    }
}

impl<T: PointerWrapper> PointerHolder<T> {
    pub fn as_ref(&self) -> *const c_types::c_void {
        self.ptr
    }
}

impl<T: PointerWrapper> From<T> for PointerHolder<T> {
    fn from(wrapper: T) -> Self {
        PointerHolder {
            ptr: wrapper.into_pointer(),
            _phantom: PhantomData,
        }
    }
}

/// A registration of a platform device.
#[derive(Default)]
pub struct Registration {
    registered: bool,
-   of_table: Option<*const c_types::c_void>,
+   // OfMatchTable automatically cleaned up on Drop
+   of_table: Option<PointerHolder<OfMatchTable>>,
    pdrv: bindings::platform_driver,
    _pin: PhantomPinned,
}

However, we are planning to make this table static const, at which point all of this goes away. So probably not worth generalizing at this point. Maybe we can revisit this when we encounter our first dynamically allocated kernel object that contains another dynamically allocated kernel object?

Apologies for the extended brain dump :)

Edit: clarified PointerHolder riff.
Edit2: mention that v2 implements PointerWrapper for OfMatchTable. Sorry for the noise.

@nbdd0121
Copy link
Member

@TheSven73 have you checked the CBoundedStr<N> API proposed in #258? It has a constant upper bound for length N, can be created safely in compile time, and provide a method to expand into a fixed length char array.

@@ -18,6 +21,7 @@ use core::{marker::PhantomPinned, pin::Pin};
#[derive(Default)]
pub struct Registration {
registered: bool,
of_table: Option<*const c_types::c_void>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be an Option<NonNull<c_types::c_void>> because as far as I can see all pointers it could contain ultimately come from a Box allocation. (This would safe at least one byte for the Option discriminant)

Copy link
Collaborator Author

@TheSven73 TheSven73 May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, creating a NonNull from a ptr requires one additional unwrap or Err() return, which would introduce one more // PANIC: annotation?

pub fn new(ptr: *mut T) -> Option<NonNull<T>>

Also its docs say but is ultimately more dangerous to use because of its additional properties ?
So not sure if the pros outweigh the cons here.
But thank you for introducing me to NonNull ! It's basically a covariant pointer type, I didn't even know that this existed. That's great to know.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, creating a NonNull from a ptr requires one additional unwrap or Err() return, which would introduce one more // PANIC: annotation?

correct but since we create that pointer by using a Box::into_raw we can use NonNull::from(Box::leak(...)) instead.

This would of course require using a NonNull everywhere where that pointer was used but I think that is manageable/mostly handled by the compiler in this case.

Also its docs say but is ultimately more dangerous to use because of its additional properties?
So not sure if the pros outweigh the cons here.

I think that message is mostly for people new to Rust being told to use NonNull and attempting to use it like a normal C/raw pointer.

From my experience working with NonNulls it is generally easier to reason that code is safe when dereferencing (since you don't have to argue that a pointer can't be NULL everywhere)
but harder when creating a pointer. e.g. pointer arithmetic becomes very verbose since you now need to argue that your arithmetic can't make the pointer NULL (which I think is a good thing anyways).

TL;DR using NonNulls made from references is easier, doing pointer magic with them is harder / more error prone.

(I do of course understand if you still think this isn't worth the risk)

But thank you for introducing me to NonNull ! It's basically a covariant pointer type, I didn't even know that this existed. That's great to know.

You're welcome :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that in this case (a one-off use inside Registration), the light is probably not worth the candle. But, NonNull<T> might be extremely useful as part of the implementation detail of more general, core abstractions. The more such an abstraction is re-used, the more we can accept verbosity and "tricky" guarantees inside of it.

I really appreciate your knowledge and expertise !

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that the Registration would live for the whole lifetime of the module so saving up to 8 bytes for basically free would be worth it but I agree that in this case it probably doesn't matter that much.

I really appreciate your knowledge and expertise!

I'm glad I can at least contribute something on the Rust usage side since I have very little prior experience with linux kernel code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Boxes aren't the only structures that can convert into() shared kernel pointers! PointerWrapper is the trait that's supposed to be implemented by objects that can go from/into shared pointers. Currently implemented for Box, Arc, Pin and Ref.

Copy link

@soruh soruh May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had not yet encountered PointerWrapper. I agree that it doesn't really make sense to provide a helper parallel to that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonNull<T> may make sense in some of the code surrounding `PointerWrapper - when there's a case to refine that abstraction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonNull is generally preferred in data structures wherever a pointer is sure to be not null. Option<NonNull<T>> can be used if a pointer is nullable. Note that Option<NonNull<T>> is even FFI-safe -- it is treated exactly as *mut T for ABI purporses.

Doing so forces the user to consider about the nullability of a pointer, so it can prevent null dereference even in unsafe code. The only exception are probably local variables, as raw pointers have more helper function to do arithmetic on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I'll take this as an opportunity to familiarize myself with the code base and look for opportunities where NonNull would be applicable and an actual improvement.

Comment on lines +35 to +37
pub fn new(compatible: &CStr<'static>) -> Result<Self> {
let tbl: InnerTable = Box::try_new([
Self::new_of_device_id(compatible)?,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this allow for multiple match strings? Isn't the point of the match table to allow for multiple of_device_id? We've used it in a driver, and apart from the array needing to be NULL-terminated, it can have any number of elements, not necessarily just one with a comma-separated string. But we might be wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely! That's planned for a follow-up PR. In fact, we want the table to be generated static const but we don't know how yet. I tried generating a variable-size table (for multiple match strings) using const generics, but I ran into an internal rustc error...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nice

self.0.into_pointer().cast()
}

fn new_of_device_id(compatible: &CStr<'static>) -> Result<bindings::of_device_id> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.len() is const so replacing line 48 with a static_assert should work out of the box

if let Some(ptr) = self.of_table {
// SAFETY: `ptr` came from an `OfMatchTable`.
let tbl = unsafe { OfMatchTable::from_pointer(ptr) };
drop(tbl);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to drop it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The drop(tbl) is completely redundant, tbl would get dropped even without it. It's just there to indicate to the reader that we are dropping the table there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Imao, it's not pretty 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe I agree... I think I used the word ugly myself :)

Comment on lines +57 to +61
if let Some(tbl) = of_match_table {
let ptr = tbl.into_pointer();
this.of_table = Some(ptr);
this.pdrv.driver.of_match_table = ptr.cast();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this.of_table should be a Option<OfMatchTable>. I don't understand the need of a raw pointer here. This could fix the drop later on ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because structures that are shared with kernel structs using raw pointers, use the PointerWrapper abstraction/trait. This trait ensures that the struct is consumed when turned into a kernel pointer. After the call to into_pointer(), there is no tbl object left to store anywhere !

@nbdd0121
Copy link
Member

I suppose that you could use a similar approach to the one used in #273 for CStr.

#[repr(transparent)]
struct OfDeviceId(bindings::of_device_id);

impl OfDeviceId {
    pub const fn nul() -> Self {
        // give a nul
    }

    pub const fn new(/* blah */) -> Self {
        // construct
    }
}

#[repr(transparent)]
struct OfMatchTable([OfDeviceId]);

impl OfMatchTable {
    pub const fn from_slice_with_nul(arr: &[OfDeviceId]) -> &Self {
        // Verify that none of non-last entries are NUL
        // Verify that last entry of arr is indeed NUL
        unsafe { Self::from_slice_with_nul_unchecked(arr) }
    }
    
    pub const unsafe fn from_slice_with_nul_unchecked(arr: &[OfDeviceId]) -> &Self {
        &*(arr as *const [OfDeviceId] as *const Self)
    }
}

then people could use

const TBL: &'static OfMatchTable = OfMatchTable::from_slice_with_nul(&[OfDeviceId::new(/*blah*/), OfDeviceId::nul()]);

@TheSven73
Copy link
Collaborator Author

@nbdd0121 that's a cool idea actually !

@TheSven73
Copy link
Collaborator Author

I'd prefer to get this merged without const fn, static or variable-size match tables for now, if possible. Can't get the perfect get in the way of the good. We can improve in a follow-up PR?

@TheSven73
Copy link
Collaborator Author

@wedsonaf in case you are unsatisfied with v2, feel free to ask for a "reboot". I can handle constructive criticism, and have plenty of alternative approaches up my sleeve :)

@wedsonaf
Copy link

@wedsonaf in case you are unsatisfied with v2, feel free to ask for a "reboot". I can handle constructive criticism, and have plenty of alternative approaches up my sleeve :)

Looking at it now. Give me a few minutes. :)

@TheSven73
Copy link
Collaborator Author

No time pressure here at all. Please do take as much time as you need!

Copy link

@wedsonaf wedsonaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Let's merge this so that we can make progress on the driver.

We can always iterate on the wrinkles that don't affect safety.

impl OfMatchTable {
/// Creates a [`OfMatchTable`] from a single `compatible` string.
pub fn new(compatible: &CStr<'static>) -> Result<Self> {
let tbl: InnerTable = Box::try_new([

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need the type (InnerTable) explicitly given here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - there's no need. I don't like unnecessary explicit type annotations either. I'll do a follow-up PR to eliminate it.

@wedsonaf wedsonaf merged commit 698f04d into Rust-for-Linux:rust May 25, 2021
@TheSven73 TheSven73 deleted the rust-for-linux-pdev-pr2 branch May 25, 2021 16:45
ojeda pushed a commit that referenced this pull request Jan 20, 2024
…test_init()

The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.

The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.

The synth_event_gen_test_exit() disables the events it created and
destroys the events.

If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.

When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:

 Running tests on trace events:
 Testing event create_synth_test:
 Enabled event during self test!
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480
 Modules linked in:
 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
 RIP: 0010:event_trace_self_tests+0x1c2/0x480
 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
 RSP: 0000:ffffc9000001fdc0 EFLAGS: 00010246
 RAX: 0000000000000029 RBX: ffff88810399ca80 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffffb9f19478 RDI: ffff88823c734e64
 RBP: ffff88810399f300 R08: 0000000000000000 R09: fffffbfff79eb32a
 R10: ffffffffbcf59957 R11: 0000000000000001 R12: ffff888104068090
 R13: ffffffffbc89f0a0 R14: ffffffffbc8a0f08 R15: 0000000000000078
 FS:  0000000000000000(0000) GS:ffff88823c700000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000001f6282001 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? __warn+0xa5/0x200
  ? event_trace_self_tests+0x1c2/0x480
  ? report_bug+0x1f6/0x220
  ? handle_bug+0x6f/0x90
  ? exc_invalid_op+0x17/0x50
  ? asm_exc_invalid_op+0x1a/0x20
  ? tracer_preempt_on+0x78/0x1c0
  ? event_trace_self_tests+0x1c2/0x480
  ? __pfx_event_trace_self_tests_init+0x10/0x10
  event_trace_self_tests_init+0x27/0xe0
  do_one_initcall+0xd6/0x3c0
  ? __pfx_do_one_initcall+0x10/0x10
  ? kasan_set_track+0x25/0x30
  ? rcu_is_watching+0x38/0x60
  kernel_init_freeable+0x324/0x450
  ? __pfx_kernel_init+0x10/0x10
  kernel_init+0x1f/0x1e0
  ? _raw_spin_unlock_irq+0x33/0x50
  ret_from_fork+0x34/0x60
  ? __pfx_kernel_init+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.

Link: https://lore.kernel.org/linux-trace-kernel/20231220111525.2f0f49b0@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 9fe41ef ("tracing: Add synth event generation test module")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Alexander Graf <graf@amazon.com>
Tested-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ojeda pushed a commit that referenced this pull request Jan 22, 2024
======================================================
WARNING: possible circular locking dependency detected
6.5.0-kfd-fkuehlin #276 Not tainted
------------------------------------------------------
kworker/8:2/2676 is trying to acquire lock:
ffff9435aae95c88 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}, at: __flush_work+0x52/0x550

but task is already holding lock:
ffff9435cd8e1720 (&svms->lock){+.+.}-{3:3}, at: svm_range_deferred_list_work+0xe8/0x340 [amdgpu]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&svms->lock){+.+.}-{3:3}:
       __mutex_lock+0x97/0xd30
       kfd_ioctl_alloc_memory_of_gpu+0x6d/0x3c0 [amdgpu]
       kfd_ioctl+0x1b2/0x5d0 [amdgpu]
       __x64_sys_ioctl+0x86/0xc0
       do_syscall_64+0x39/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> #1 (&mm->mmap_lock){++++}-{3:3}:
       down_read+0x42/0x160
       svm_range_evict_svm_bo_worker+0x8b/0x340 [amdgpu]
       process_one_work+0x27a/0x540
       worker_thread+0x53/0x3e0
       kthread+0xeb/0x120
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x11/0x20

-> #0 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}:
       __lock_acquire+0x1426/0x2200
       lock_acquire+0xc1/0x2b0
       __flush_work+0x80/0x550
       __cancel_work_timer+0x109/0x190
       svm_range_bo_release+0xdc/0x1c0 [amdgpu]
       svm_range_free+0x175/0x180 [amdgpu]
       svm_range_deferred_list_work+0x15d/0x340 [amdgpu]
       process_one_work+0x27a/0x540
       worker_thread+0x53/0x3e0
       kthread+0xeb/0x120
       ret_from_fork+0x31/0x50
       ret_from_fork_asm+0x11/0x20

other info that might help us debug this:

Chain exists of:
  (work_completion)(&svm_bo->eviction_work) --> &mm->mmap_lock --> &svms->lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&svms->lock);
                               lock(&mm->mmap_lock);
                               lock(&svms->lock);
  lock((work_completion)(&svm_bo->eviction_work));

I believe this cannot really lead to a deadlock in practice, because
svm_range_evict_svm_bo_worker only takes the mmap_read_lock if the BO
refcount is non-0. That means it's impossible that svm_range_bo_release
is running concurrently. However, there is no good way to annotate this.

To avoid the problem, take a BO reference in
svm_range_schedule_evict_svm_bo instead of in the worker. That way it's
impossible for a BO to get freed while eviction work is pending and the
cancel_work_sync call in svm_range_bo_release can be eliminated.

v2: Use svm_bo_ref_unless_zero and explained why that's safe. Also
removed redundant checks that are already done in
amdkfd_fence_enable_signaling.

Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants