Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using a bunch of cfg flags and use of hardcoded auto.conf path #22

Closed
kloenk opened this issue Oct 10, 2020 · 3 comments
Closed

Comments

@kloenk
Copy link
Member

kloenk commented Oct 10, 2020

Currently, building with O=$something fails as auto.conf is search for in the source tree.
Also the arguments passed as cfg parsed from auto.conf can get bigger as the max argument length of linux.

@kloenk
Copy link
Member Author

kloenk commented Oct 10, 2020

For configs, that are y, do we want to pass --cfg=name="y" or do we just want --cfg=name

@kloenk kloenk added this to the First PR milestone Oct 10, 2020
@ojeda
Copy link
Member

ojeda commented Oct 10, 2020

For configs, that are y, do we want to pass --cfg=name="y" or do we just want --cfg=name

Good question. On the C side we have the IS_BUILTIN/MODULE/REACHABLE/ENABLED macros as well as just #ifdef etc.

I guess for ease of use we could provide each CONFIG_X value for both the value and the "empty" value, i.e.:

  • For y: CONFIG_X and CONFIG_X="y".
  • For m: CONFIG_X and CONFIG_X="m".

We can forget about the IS_REACHABLE for the moment (one can always put the equivalent predicate if needed).

For other kinds of values (e.g. integers and strings), I don't think we want to set the CONFIG_X, but it does not hurt and would be consistent.

ojeda pushed a commit that referenced this issue Nov 28, 2020
It also helps to avoid spamming the terminal when doing verbose builds.

Signed-off-by: Finn Behrens <me@kloenk.de>
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
@ojeda
Copy link
Member

ojeda commented Nov 28, 2020

For configs, that are y, do we want to pass --cfg=name="y" or do we just want --cfg=name

Good question. On the C side we have the IS_BUILTIN/MODULE/REACHABLE/ENABLED macros as well as just #ifdef etc.

I guess for ease of use we could provide each CONFIG_X value for both the value and the "empty" value, i.e.:

  • For y: CONFIG_X and CONFIG_X="y".
  • For m: CONFIG_X and CONFIG_X="m".

We can forget about the IS_REACHABLE for the moment (one can always put the equivalent predicate if needed).

For other kinds of values (e.g. integers and strings), I don't think we want to set the CONFIG_X, but it does not hurt and would be consistent.

This is handled by #32

@ojeda ojeda closed this as completed in 39484fa Nov 28, 2020
ojeda added a commit that referenced this issue Nov 28, 2020
Use @file to pass the --cfg flags to rustc (fixes #22)
ojeda pushed a commit that referenced this issue Jul 29, 2021
The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
At this point, dev_xdp_uninstall() is called. Then xdp link will not be
detached automatically when dev is released. But link->dev already
points to dev, when xdp link is released, dev will still be accessed,
but dev has been released.

dev_get_by_index()        |
link->dev = dev           |
                          |      rtnl_lock()
                          |      unregister_netdevice_many()
                          |          dev_xdp_uninstall()
                          |      rtnl_unlock()
rtnl_lock();              |
dev_xdp_attach_link()     |
rtnl_unlock();            |
                          |      netdev_run_todo() // dev released
bpf_xdp_link_release()    |
    /* access dev.        |
       use-after-free */  |

[   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
[   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
[   45.968297]
[   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
[   45.969222] Hardware name: linux,dummy-virt (DT)
[   45.969795] Call trace:
[   45.970106]  dump_backtrace+0x0/0x4c8
[   45.970564]  show_stack+0x30/0x40
[   45.970981]  dump_stack_lvl+0x120/0x18c
[   45.971470]  print_address_description.constprop.0+0x74/0x30c
[   45.972182]  kasan_report+0x1e8/0x200
[   45.972659]  __asan_report_load8_noabort+0x2c/0x50
[   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.973834]  bpf_link_free+0xd0/0x188
[   45.974315]  bpf_link_put+0x1d0/0x218
[   45.974790]  bpf_link_release+0x3c/0x58
[   45.975291]  __fput+0x20c/0x7e8
[   45.975706]  ____fput+0x24/0x30
[   45.976117]  task_work_run+0x104/0x258
[   45.976609]  do_notify_resume+0x894/0xaf8
[   45.977121]  work_pending+0xc/0x328
[   45.977575]
[   45.977775] The buggy address belongs to the page:
[   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
[   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
[   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
[   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   45.982259] page dumped because: kasan: bad access detected
[   45.982948]
[   45.983153] Memory state around the buggy address:
[   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.986419]                                               ^
[   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988895] ==================================================================
[   45.989773] Disabling lock debugging due to kernel taint
[   45.990552] Kernel panic - not syncing: panic_on_warn set ...
[   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
[   45.991929] Hardware name: linux,dummy-virt (DT)
[   45.992448] Call trace:
[   45.992753]  dump_backtrace+0x0/0x4c8
[   45.993208]  show_stack+0x30/0x40
[   45.993627]  dump_stack_lvl+0x120/0x18c
[   45.994113]  dump_stack+0x1c/0x34
[   45.994530]  panic+0x3a4/0x7d8
[   45.994930]  end_report+0x194/0x198
[   45.995380]  kasan_report+0x134/0x200
[   45.995850]  __asan_report_load8_noabort+0x2c/0x50
[   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.997007]  bpf_link_free+0xd0/0x188
[   45.997474]  bpf_link_put+0x1d0/0x218
[   45.997942]  bpf_link_release+0x3c/0x58
[   45.998429]  __fput+0x20c/0x7e8
[   45.998833]  ____fput+0x24/0x30
[   45.999247]  task_work_run+0x104/0x258
[   45.999731]  do_notify_resume+0x894/0xaf8
[   46.000236]  work_pending+0xc/0x328
[   46.000697] SMP: stopping secondary CPUs
[   46.001226] Dumping ftrace buffer:
[   46.001663]    (ftrace buffer empty)
[   46.002110] Kernel Offset: disabled
[   46.002545] CPU features: 0x00000001,23202c00
[   46.003080] Memory Limit: none

Fixes: aa8d3a7 ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
ojeda pushed a commit that referenced this issue Dec 20, 2021
In commit 142639a ("drm/msm/a6xx: fix crashstate capture for
A650") we changed a6xx_get_gmu_registers() to read 3 sets of
registers. Unfortunately, we didn't change the memory allocation for
the array. That leads to a KASAN warning (this was on the chromeos-5.4
kernel, which has the problematic commit backported to it):

  BUG: KASAN: slab-out-of-bounds in _a6xx_get_gmu_registers+0x144/0x430
  Write of size 8 at addr ffffff80c89432b0 by task A618-worker/209
  CPU: 5 PID: 209 Comm: A618-worker Tainted: G        W         5.4.156-lockdep #22
  Hardware name: Google Lazor Limozeen without Touchscreen (rev5 - rev8) (DT)
  Call trace:
   dump_backtrace+0x0/0x248
   show_stack+0x20/0x2c
   dump_stack+0x128/0x1ec
   print_address_description+0x88/0x4a0
   __kasan_report+0xfc/0x120
   kasan_report+0x10/0x18
   __asan_report_store8_noabort+0x1c/0x24
   _a6xx_get_gmu_registers+0x144/0x430
   a6xx_gpu_state_get+0x330/0x25d4
   msm_gpu_crashstate_capture+0xa0/0x84c
   recover_worker+0x328/0x838
   kthread_worker_fn+0x32c/0x574
   kthread+0x2dc/0x39c
   ret_from_fork+0x10/0x18

  Allocated by task 209:
   __kasan_kmalloc+0xfc/0x1c4
   kasan_kmalloc+0xc/0x14
   kmem_cache_alloc_trace+0x1f0/0x2a0
   a6xx_gpu_state_get+0x164/0x25d4
   msm_gpu_crashstate_capture+0xa0/0x84c
   recover_worker+0x328/0x838
   kthread_worker_fn+0x32c/0x574
   kthread+0x2dc/0x39c
   ret_from_fork+0x10/0x18

Fixes: 142639a ("drm/msm/a6xx: fix crashstate capture for A650")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20211103153049.1.Idfa574ccb529d17b69db3a1852e49b580132035c@changeid
Signed-off-by: Rob Clark <robdclark@chromium.org>
ojeda pushed a commit that referenced this issue Jan 10, 2022
…port_id()

The array param[] in elantech_change_report_id() must be at least 3
bytes, because elantech_read_reg_params() is calling ps2_command() with
PSMOUSE_CMD_GETINFO, that is going to access 3 bytes from param[], but
it's defined in the stack as an array of 2 bytes, therefore we have a
potential stack out-of-bounds access here, also confirmed by KASAN:

[    6.512374] BUG: KASAN: stack-out-of-bounds in __ps2_command+0x372/0x7e0
[    6.512397] Read of size 1 at addr ffff8881024d77c2 by task kworker/2:1/118

[    6.512416] CPU: 2 PID: 118 Comm: kworker/2:1 Not tainted 5.13.0-22-generic #22+arighi20211110
[    6.512428] Hardware name: LENOVO 20T8000QGE/20T8000QGE, BIOS R1AET32W (1.08 ) 08/14/2020
[    6.512436] Workqueue: events_long serio_handle_event
[    6.512453] Call Trace:
[    6.512462]  show_stack+0x52/0x58
[    6.512474]  dump_stack+0xa1/0xd3
[    6.512487]  print_address_description.constprop.0+0x1d/0x140
[    6.512502]  ? __ps2_command+0x372/0x7e0
[    6.512516]  __kasan_report.cold+0x7d/0x112
[    6.512527]  ? _raw_write_lock_irq+0x20/0xd0
[    6.512539]  ? __ps2_command+0x372/0x7e0
[    6.512552]  kasan_report+0x3c/0x50
[    6.512564]  __asan_load1+0x6a/0x70
[    6.512575]  __ps2_command+0x372/0x7e0
[    6.512589]  ? ps2_drain+0x240/0x240
[    6.512601]  ? dev_printk_emit+0xa2/0xd3
[    6.512612]  ? dev_vprintk_emit+0xc5/0xc5
[    6.512621]  ? __kasan_check_write+0x14/0x20
[    6.512634]  ? mutex_lock+0x8f/0xe0
[    6.512643]  ? __mutex_lock_slowpath+0x20/0x20
[    6.512655]  ps2_command+0x52/0x90
[    6.512670]  elantech_ps2_command+0x4f/0xc0 [psmouse]
[    6.512734]  elantech_change_report_id+0x1e6/0x256 [psmouse]
[    6.512799]  ? elantech_report_trackpoint.constprop.0.cold+0xd/0xd [psmouse]
[    6.512863]  ? ps2_command+0x7f/0x90
[    6.512877]  elantech_query_info.cold+0x6bd/0x9ed [psmouse]
[    6.512943]  ? elantech_setup_ps2+0x460/0x460 [psmouse]
[    6.513005]  ? psmouse_reset+0x69/0xb0 [psmouse]
[    6.513064]  ? psmouse_attr_set_helper+0x2a0/0x2a0 [psmouse]
[    6.513122]  ? phys_pmd_init+0x30e/0x521
[    6.513137]  elantech_init+0x8a/0x200 [psmouse]
[    6.513200]  ? elantech_init_ps2+0xf0/0xf0 [psmouse]
[    6.513249]  ? elantech_query_info+0x440/0x440 [psmouse]
[    6.513296]  ? synaptics_send_cmd+0x60/0x60 [psmouse]
[    6.513342]  ? elantech_query_info+0x440/0x440 [psmouse]
[    6.513388]  ? psmouse_try_protocol+0x11e/0x170 [psmouse]
[    6.513432]  psmouse_extensions+0x65d/0x6e0 [psmouse]
[    6.513476]  ? psmouse_try_protocol+0x170/0x170 [psmouse]
[    6.513519]  ? mutex_unlock+0x22/0x40
[    6.513526]  ? ps2_command+0x7f/0x90
[    6.513536]  ? psmouse_probe+0xa3/0xf0 [psmouse]
[    6.513580]  psmouse_switch_protocol+0x27d/0x2e0 [psmouse]
[    6.513624]  psmouse_connect+0x272/0x530 [psmouse]
[    6.513669]  serio_driver_probe+0x55/0x70
[    6.513679]  really_probe+0x190/0x720
[    6.513689]  driver_probe_device+0x160/0x1f0
[    6.513697]  device_driver_attach+0x119/0x130
[    6.513705]  ? device_driver_attach+0x130/0x130
[    6.513713]  __driver_attach+0xe7/0x1a0
[    6.513720]  ? device_driver_attach+0x130/0x130
[    6.513728]  bus_for_each_dev+0xfb/0x150
[    6.513738]  ? subsys_dev_iter_exit+0x10/0x10
[    6.513748]  ? _raw_write_unlock_bh+0x30/0x30
[    6.513757]  driver_attach+0x2d/0x40
[    6.513764]  serio_handle_event+0x199/0x3d0
[    6.513775]  process_one_work+0x471/0x740
[    6.513785]  worker_thread+0x2d2/0x790
[    6.513794]  ? process_one_work+0x740/0x740
[    6.513802]  kthread+0x1b4/0x1e0
[    6.513809]  ? set_kthread_struct+0x80/0x80
[    6.513816]  ret_from_fork+0x22/0x30

[    6.513832] The buggy address belongs to the page:
[    6.513838] page:00000000bc35e189 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1024d7
[    6.513847] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[    6.513860] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
[    6.513867] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[    6.513872] page dumped because: kasan: bad access detected

[    6.513879] addr ffff8881024d77c2 is located in stack of task kworker/2:1/118 at offset 34 in frame:
[    6.513887]  elantech_change_report_id+0x0/0x256 [psmouse]

[    6.513941] this frame has 1 object:
[    6.513947]  [32, 34) 'param'

[    6.513956] Memory state around the buggy address:
[    6.513962]  ffff8881024d7680: f2 f2 f2 f2 f2 00 00 f3 f3 00 00 00 00 00 00 00
[    6.513969]  ffff8881024d7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    6.513976] >ffff8881024d7780: 00 00 00 00 f1 f1 f1 f1 02 f3 f3 f3 00 00 00 00
[    6.513982]                                            ^
[    6.513988]  ffff8881024d7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    6.513995]  ffff8881024d7880: 00 f1 f1 f1 f1 03 f2 03 f2 03 f3 f3 f3 00 00 00
[    6.514000] ==================================================================

Define param[] in elantech_change_report_id() as an array of 3 bytes to
prevent the out-of-bounds access in the stack.

Fixes: e4c9062 ("Input: elantech - fix protocol errors for some trackpoints in SMBus mode")
BugLink: https://bugs.launchpad.net/bugs/1945590
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: Wolfram Sang <wsa@kernel.org>
Link: https://lore.kernel.org/r/20211116095559.24395-1-andrea.righi@canonical.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
ojeda pushed a commit that referenced this issue Feb 28, 2022
When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this issue Apr 12, 2022
After waking up a suspended VM, the kernel prints the following trace
for virtio drivers which do not directly call virtio_device_ready() in
the .restore:

    PM: suspend exit
    irq 22: nobody cared (try booting with the "irqpoll" option)
    Call Trace:
     <IRQ>
     dump_stack_lvl+0x38/0x49
     dump_stack+0x10/0x12
     __report_bad_irq+0x3a/0xaf
     note_interrupt.cold+0xb/0x60
     handle_irq_event+0x71/0x80
     handle_fasteoi_irq+0x95/0x1e0
     __common_interrupt+0x6b/0x110
     common_interrupt+0x63/0xe0
     asm_common_interrupt+0x1e/0x40
     ? __do_softirq+0x75/0x2f3
     irq_exit_rcu+0x93/0xe0
     sysvec_apic_timer_interrupt+0xac/0xd0
     </IRQ>
     <TASK>
     asm_sysvec_apic_timer_interrupt+0x12/0x20
     arch_cpu_idle+0x12/0x20
     default_idle_call+0x39/0xf0
     do_idle+0x1b5/0x210
     cpu_startup_entry+0x20/0x30
     start_secondary+0xf3/0x100
     secondary_startup_64_no_verify+0xc3/0xcb
     </TASK>
    handlers:
    [<000000008f9bac49>] vp_interrupt
    [<000000008f9bac49>] vp_interrupt
    Disabling IRQ #22

This happens because we don't invoke .enable_cbs callback in
virtio_device_restore(). That callback is used by some transports
(e.g. virtio-pci) to enable interrupts.

Let's fix it, by calling virtio_device_ready() as we do in
virtio_dev_probe(). This function calls .enable_cts callback and sets
DRIVER_OK status bit.

This fix also avoids setting DRIVER_OK twice for those drivers that
call virtio_device_ready() in the .restore.

Fixes: d50497e ("virtio_config: introduce a new .enable_cbs method")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20220322114313.116516-1-sgarzare@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
ojeda pushed a commit that referenced this issue Dec 4, 2022
Tests for races between shinfo_cache (de)activation and hypercall+ioctl()
processing.  KVM has had bugs where activating the shared info cache
multiple times and/or with concurrent users results in lock corruption,
NULL pointer dereferences, and other fun.

For the timer injection testcase (#22), re-arm the timer until the IRQ
is successfully injected.  If the timer expires while the shared info
is deactivated (invalid), KVM will drop the event.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013211234.1318131-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ojeda pushed a commit that referenced this issue Apr 6, 2023
When a system with E810 with existing VFs gets rebooted the following
hang may be observed.

 Pid 1 is hung in iavf_remove(), part of a network driver:
 PID: 1        TASK: ffff965400e5a340  CPU: 24   COMMAND: "systemd-shutdow"
  #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
  #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
  #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
  #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
  #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
  #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
  #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
  #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
  #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
  #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
 #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
 #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
 #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
 #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
 #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
 #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
 #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
 #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
 #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
 #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
 #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
 #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
 #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
     RIP: 00007f1baa5c13d7  RSP: 00007fffbcc55a98  RFLAGS: 00000202
     RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f1baa5c13d7
     RDX: 0000000001234567  RSI: 0000000028121969  RDI: 00000000fee1dead
     RBP: 00007fffbcc55ca0   R8: 0000000000000000   R9: 00007fffbcc54e90
     R10: 00007fffbcc55050  R11: 0000000000000202  R12: 0000000000000005
     R13: 0000000000000000  R14: 00007fffbcc55af0  R15: 0000000000000000
     ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

During reboot all drivers PM shutdown callbacks are invoked.
In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
In ice_shutdown() the call chain above is executed, which at some point
calls iavf_remove(). However iavf_remove() expects the VF to be in one
of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
that's not the case it sleeps forever.
So if iavf_shutdown() gets invoked before iavf_remove() the system will
hang indefinitely because the adapter is already in state __IAVF_REMOVE.

Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE,
as we already went through iavf_shutdown().

Fixes: 9745780 ("iavf: Add waiting so the port is initialized in remove")
Fixes: a841733 ("iavf: Fix race condition between iavf_shutdown and iavf_remove")
Reported-by: Marius Cornea <mcornea@redhat.com>
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
fbq pushed a commit that referenced this issue Sep 25, 2023
The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
gurugio pushed a commit to gurugio/rust-for-linux that referenced this issue Oct 2, 2023
At least it does not panic.
I checked the pointer of proc_open and proc_ops->proc_open value.
They are same. So I guess the function pointer setting is correct.

And I added messages to check if open function is crashed.
        for _ in 0..10000 {
            pr_err!("proc_open is invoked\n");
        }
Then I found out that the read generates crash as below.

/ # insmod share/rust_proc.ko
[    6.944654] rust_proc: module verification failed: signature and/or required key missing - tainting kernel
[    6.946329] rust_proc: rust_proc is loaded
[    6.946981] proc_create_data: rust_proc_fs proc_open=ffffffffc0201040
[    6.947959] rust_proc: succeeded to create a proc entry: 0xffff888005469780 proc_open=0xffffffffc0201040
/ # cat /proc/rust_demo/rust_proc_fs
.........
.........
[   15.546497] rust_proc: proc_open is invoked
[   15.546836] rust_proc: proc_open is invoked
[   15.547176] rust_proc: proc_open is invoked
[   15.547530] rust_proc: proc_open is invoked
[   15.547866] rust_proc: proc_open is invoked
[   15.548204] rust_proc: proc_open is invoked
[   15.548544] rust_proc: proc_open is invoked
[   15.549052] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.549617] #PF: supervisor instruction fetch in kernel mode
[   15.549801] #PF: error_code(0x0010) - not-present page
[   15.549801] PGD 561e067 P4D 561e067 PUD 561c067 PMD 0
[   15.549801] Oops: 0010 [#1] PREEMPT SMP NOPTI
[   15.549801] CPU: 0 PID: 120 Comm: cat Tainted: G            E      6.3.0+ Rust-for-Linux#22
[   15.549801] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] Call Trace:
[   15.549801]  <TASK>
[   15.549801]  ? proc_reg_read+0xe8/0x150
[   15.549801]  vfs_read+0xb4/0x260
[   15.549801]  ? do_sendfile+0x1cf/0x3f0
[   15.549801]  ksys_read+0x5f/0xb0
[   15.549801]  __x64_sys_read+0x1b/0x20
[   15.549801]  do_syscall_64+0x35/0x50
[   15.549801]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   15.549801] RIP: 0033:0x4ad272
[   15.549801] Code: 31 c0 e9 b1 fe ff ff 50 48 8d 3d c1 80 17 00 e8 54 8e 00 00 0f 1f 40 00 f3 0f 1e fa 64 8b 04 25 18 00 00 04
[   15.549801] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.549801] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.549801] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.549801] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.549801] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.549801]  </TASK>
[   15.549801] Modules linked in: rust_proc(E)
[   15.549801] CR2: 0000000000000001
[   15.549801] ---[ end trace 0000000000000000 ]---
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] note: cat[120] exited with irqs disabled
[   15.572950] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.573491] #PF: supervisor instruction fetch in kernel mode
[   15.573932] #PF: error_code(0x0010) - not-present page
[   15.574335] PGD 0 P4D 0
[   15.574535] Oops: 0010 [#2] PREEMPT SMP NOPTI
[   15.574892] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.575462] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.576107] RIP: 0010:0x1
[   15.576328] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3de8 EFLAGS: 00010246
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: 0000000000000001
[   15.576842] RDX: ffff8880054ec800 RSI: ffff888005729600 RDI: ffff888004f7ce08
[   15.576842] RBP: ffff8880056b3e30 R08: ffff888003c43c00 R09: ffff888004f7ce08
[   15.576842] R10: ffffea00000f93c0 R11: 0000000000000001 R12: ffff8880056bae10
[   15.576842] R13: 00000000000a800d R14: ffff888005469f00 R15: ffff888005469f18
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] Call Trace:
[   15.576842]  <TASK>
[   15.576842]  ? close_pdeo+0x59/0x120
[   15.576842]  proc_reg_release+0x6f/0x80
[   15.576842]  __fput+0xf0/0x220
[   15.576842]  ____fput+0xe/0x10
[   15.576842]  task_work_run+0xc3/0xe0
[   15.576842]  do_exit+0x3e2/0xab0
[   15.576842]  make_task_dead+0x83/0x130
[   15.576842]  rewind_stack_and_make_dead+0x17/0x20
[   15.576842] RIP: 0033:0x4ad272
[   15.576842] Code: Unable to access opcode bytes at 0x4ad248.
[   15.576842] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.576842] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.576842] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.576842] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.576842] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.576842]  </TASK>
[   15.576842] Modules linked in: rust_proc(E)
[   15.576842] CR2: 0000000000000001
[   15.576842] ---[ end trace 0000000000000000 ]---
[   15.576842] RIP: 0010:0x1
[   15.576842] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.576842] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.576842] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.576842] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] note: cat[120] exited with irqs disabled
[   15.595880] Fixing recursive fault but reboot is needed!
[   15.596287] BUG: scheduling while atomic: cat/120/0x00000000
[   15.596720] Modules linked in: rust_proc(E)
[   15.597039] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.597587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.598226] Call Trace:
[   15.598420]  <TASK>
[   15.598592]  dump_stack_lvl+0x58/0x70
[   15.598883]  dump_stack+0x10/0x20
[   15.599141]  __schedule_bug+0x62/0x70
[   15.599449]  __schedule+0x838/0x1450
[   15.599731]  ? vprintk_default+0x1d/0x20
[   15.599849]  ? vprintk+0x60/0x80
[   15.599849]  ? _printk+0x4b/0x50
[   15.599849]  do_task_dead+0x41/0x50
[   15.599849]  make_task_dead+0x129/0x130
[   15.599849]  rewind_stack_and_make_dead+0x17/0x20
[   15.599849] RIP: 0033:0x4ad272
[   15.599849] Code: Unable to access opcode bytes at 0x4ad248.
[   15.599849] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.599849] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.599849] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.599849] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.599849] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.599849] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.599849]  </TASK>

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8379593fa4bb..bea879760ebc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -573,7 +573,8 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
        p = proc_create_reg(name, mode, &parent, data);
        if (!p)
                return NULL;
-       p->proc_ops = proc_ops;
+       printk(KERN_ERR "proc_create_data: %s proc_open=%px\n", name, proc_ops->proc_open);
+       p->proc_ops = proc_ops;
        pde_set_flags(p);
        return proc_register(parent, p);
 }
gurugio pushed a commit to gurugio/rust-for-linux that referenced this issue Oct 8, 2023
At least it does not panic.
I checked the pointer of proc_open and proc_ops->proc_open value.
They are same. So I guess the function pointer setting is correct.

And I added messages to check if open function is crashed.
        for _ in 0..10000 {
            pr_err!("proc_open is invoked\n");
        }
Then I found out that the read generates crash as below.

/ # insmod share/rust_proc.ko
[    6.944654] rust_proc: module verification failed: signature and/or required key missing - tainting kernel
[    6.946329] rust_proc: rust_proc is loaded
[    6.946981] proc_create_data: rust_proc_fs proc_open=ffffffffc0201040
[    6.947959] rust_proc: succeeded to create a proc entry: 0xffff888005469780 proc_open=0xffffffffc0201040
/ # cat /proc/rust_demo/rust_proc_fs
.........
.........
[   15.546497] rust_proc: proc_open is invoked
[   15.546836] rust_proc: proc_open is invoked
[   15.547176] rust_proc: proc_open is invoked
[   15.547530] rust_proc: proc_open is invoked
[   15.547866] rust_proc: proc_open is invoked
[   15.548204] rust_proc: proc_open is invoked
[   15.548544] rust_proc: proc_open is invoked
[   15.549052] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.549617] #PF: supervisor instruction fetch in kernel mode
[   15.549801] #PF: error_code(0x0010) - not-present page
[   15.549801] PGD 561e067 P4D 561e067 PUD 561c067 PMD 0
[   15.549801] Oops: 0010 [#1] PREEMPT SMP NOPTI
[   15.549801] CPU: 0 PID: 120 Comm: cat Tainted: G            E      6.3.0+ Rust-for-Linux#22
[   15.549801] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] Call Trace:
[   15.549801]  <TASK>
[   15.549801]  ? proc_reg_read+0xe8/0x150
[   15.549801]  vfs_read+0xb4/0x260
[   15.549801]  ? do_sendfile+0x1cf/0x3f0
[   15.549801]  ksys_read+0x5f/0xb0
[   15.549801]  __x64_sys_read+0x1b/0x20
[   15.549801]  do_syscall_64+0x35/0x50
[   15.549801]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   15.549801] RIP: 0033:0x4ad272
[   15.549801] Code: 31 c0 e9 b1 fe ff ff 50 48 8d 3d c1 80 17 00 e8 54 8e 00 00 0f 1f 40 00 f3 0f 1e fa 64 8b 04 25 18 00 00 04
[   15.549801] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.549801] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.549801] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.549801] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.549801] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.549801]  </TASK>
[   15.549801] Modules linked in: rust_proc(E)
[   15.549801] CR2: 0000000000000001
[   15.549801] ---[ end trace 0000000000000000 ]---
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] note: cat[120] exited with irqs disabled
[   15.572950] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.573491] #PF: supervisor instruction fetch in kernel mode
[   15.573932] #PF: error_code(0x0010) - not-present page
[   15.574335] PGD 0 P4D 0
[   15.574535] Oops: 0010 [#2] PREEMPT SMP NOPTI
[   15.574892] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.575462] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.576107] RIP: 0010:0x1
[   15.576328] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3de8 EFLAGS: 00010246
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: 0000000000000001
[   15.576842] RDX: ffff8880054ec800 RSI: ffff888005729600 RDI: ffff888004f7ce08
[   15.576842] RBP: ffff8880056b3e30 R08: ffff888003c43c00 R09: ffff888004f7ce08
[   15.576842] R10: ffffea00000f93c0 R11: 0000000000000001 R12: ffff8880056bae10
[   15.576842] R13: 00000000000a800d R14: ffff888005469f00 R15: ffff888005469f18
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] Call Trace:
[   15.576842]  <TASK>
[   15.576842]  ? close_pdeo+0x59/0x120
[   15.576842]  proc_reg_release+0x6f/0x80
[   15.576842]  __fput+0xf0/0x220
[   15.576842]  ____fput+0xe/0x10
[   15.576842]  task_work_run+0xc3/0xe0
[   15.576842]  do_exit+0x3e2/0xab0
[   15.576842]  make_task_dead+0x83/0x130
[   15.576842]  rewind_stack_and_make_dead+0x17/0x20
[   15.576842] RIP: 0033:0x4ad272
[   15.576842] Code: Unable to access opcode bytes at 0x4ad248.
[   15.576842] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.576842] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.576842] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.576842] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.576842] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.576842]  </TASK>
[   15.576842] Modules linked in: rust_proc(E)
[   15.576842] CR2: 0000000000000001
[   15.576842] ---[ end trace 0000000000000000 ]---
[   15.576842] RIP: 0010:0x1
[   15.576842] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.576842] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.576842] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.576842] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] note: cat[120] exited with irqs disabled
[   15.595880] Fixing recursive fault but reboot is needed!
[   15.596287] BUG: scheduling while atomic: cat/120/0x00000000
[   15.596720] Modules linked in: rust_proc(E)
[   15.597039] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.597587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.598226] Call Trace:
[   15.598420]  <TASK>
[   15.598592]  dump_stack_lvl+0x58/0x70
[   15.598883]  dump_stack+0x10/0x20
[   15.599141]  __schedule_bug+0x62/0x70
[   15.599449]  __schedule+0x838/0x1450
[   15.599731]  ? vprintk_default+0x1d/0x20
[   15.599849]  ? vprintk+0x60/0x80
[   15.599849]  ? _printk+0x4b/0x50
[   15.599849]  do_task_dead+0x41/0x50
[   15.599849]  make_task_dead+0x129/0x130
[   15.599849]  rewind_stack_and_make_dead+0x17/0x20
[   15.599849] RIP: 0033:0x4ad272
[   15.599849] Code: Unable to access opcode bytes at 0x4ad248.
[   15.599849] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.599849] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.599849] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.599849] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.599849] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.599849] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.599849]  </TASK>

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8379593fa4bb..bea879760ebc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -573,7 +573,8 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
        p = proc_create_reg(name, mode, &parent, data);
        if (!p)
                return NULL;
-       p->proc_ops = proc_ops;
+       printk(KERN_ERR "proc_create_data: %s proc_open=%px\n", name, proc_ops->proc_open);
+       p->proc_ops = proc_ops;
        pde_set_flags(p);
        return proc_register(parent, p);
 }
gurugio pushed a commit to gurugio/rust-for-linux that referenced this issue Oct 14, 2023
At least it does not panic.
I checked the pointer of proc_open and proc_ops->proc_open value.
They are same. So I guess the function pointer setting is correct.

And I added messages to check if open function is crashed.
        for _ in 0..10000 {
            pr_err!("proc_open is invoked\n");
        }
Then I found out that the read generates crash as below.

/ # insmod share/rust_proc.ko
[    6.944654] rust_proc: module verification failed: signature and/or required key missing - tainting kernel
[    6.946329] rust_proc: rust_proc is loaded
[    6.946981] proc_create_data: rust_proc_fs proc_open=ffffffffc0201040
[    6.947959] rust_proc: succeeded to create a proc entry: 0xffff888005469780 proc_open=0xffffffffc0201040
/ # cat /proc/rust_demo/rust_proc_fs
.........
.........
[   15.546497] rust_proc: proc_open is invoked
[   15.546836] rust_proc: proc_open is invoked
[   15.547176] rust_proc: proc_open is invoked
[   15.547530] rust_proc: proc_open is invoked
[   15.547866] rust_proc: proc_open is invoked
[   15.548204] rust_proc: proc_open is invoked
[   15.548544] rust_proc: proc_open is invoked
[   15.549052] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.549617] #PF: supervisor instruction fetch in kernel mode
[   15.549801] #PF: error_code(0x0010) - not-present page
[   15.549801] PGD 561e067 P4D 561e067 PUD 561c067 PMD 0
[   15.549801] Oops: 0010 [#1] PREEMPT SMP NOPTI
[   15.549801] CPU: 0 PID: 120 Comm: cat Tainted: G            E      6.3.0+ Rust-for-Linux#22
[   15.549801] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] Call Trace:
[   15.549801]  <TASK>
[   15.549801]  ? proc_reg_read+0xe8/0x150
[   15.549801]  vfs_read+0xb4/0x260
[   15.549801]  ? do_sendfile+0x1cf/0x3f0
[   15.549801]  ksys_read+0x5f/0xb0
[   15.549801]  __x64_sys_read+0x1b/0x20
[   15.549801]  do_syscall_64+0x35/0x50
[   15.549801]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   15.549801] RIP: 0033:0x4ad272
[   15.549801] Code: 31 c0 e9 b1 fe ff ff 50 48 8d 3d c1 80 17 00 e8 54 8e 00 00 0f 1f 40 00 f3 0f 1e fa 64 8b 04 25 18 00 00 04
[   15.549801] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.549801] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.549801] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.549801] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.549801] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.549801]  </TASK>
[   15.549801] Modules linked in: rust_proc(E)
[   15.549801] CR2: 0000000000000001
[   15.549801] ---[ end trace 0000000000000000 ]---
[   15.549801] RIP: 0010:0x1
[   15.549801] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.549801] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.549801] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.549801] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.549801] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.549801] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.549801] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.549801] FS:  0000000001e153c0(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.549801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.549801] CR2: ffffffffffffffd7 CR3: 0000000003c60000 CR4: 00000000000006f0
[   15.549801] note: cat[120] exited with irqs disabled
[   15.572950] BUG: kernel NULL pointer dereference, address: 0000000000000001
[   15.573491] #PF: supervisor instruction fetch in kernel mode
[   15.573932] #PF: error_code(0x0010) - not-present page
[   15.574335] PGD 0 P4D 0
[   15.574535] Oops: 0010 [#2] PREEMPT SMP NOPTI
[   15.574892] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.575462] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.576107] RIP: 0010:0x1
[   15.576328] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3de8 EFLAGS: 00010246
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: 0000000000000001
[   15.576842] RDX: ffff8880054ec800 RSI: ffff888005729600 RDI: ffff888004f7ce08
[   15.576842] RBP: ffff8880056b3e30 R08: ffff888003c43c00 R09: ffff888004f7ce08
[   15.576842] R10: ffffea00000f93c0 R11: 0000000000000001 R12: ffff8880056bae10
[   15.576842] R13: 00000000000a800d R14: ffff888005469f00 R15: ffff888005469f18
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] Call Trace:
[   15.576842]  <TASK>
[   15.576842]  ? close_pdeo+0x59/0x120
[   15.576842]  proc_reg_release+0x6f/0x80
[   15.576842]  __fput+0xf0/0x220
[   15.576842]  ____fput+0xe/0x10
[   15.576842]  task_work_run+0xc3/0xe0
[   15.576842]  do_exit+0x3e2/0xab0
[   15.576842]  make_task_dead+0x83/0x130
[   15.576842]  rewind_stack_and_make_dead+0x17/0x20
[   15.576842] RIP: 0033:0x4ad272
[   15.576842] Code: Unable to access opcode bytes at 0x4ad248.
[   15.576842] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.576842] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.576842] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.576842] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.576842] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.576842]  </TASK>
[   15.576842] Modules linked in: rust_proc(E)
[   15.576842] CR2: 0000000000000001
[   15.576842] ---[ end trace 0000000000000000 ]---
[   15.576842] RIP: 0010:0x1
[   15.576842] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[   15.576842] RSP: 0018:ffff8880056b3e00 EFLAGS: 00010202
[   15.576842] RAX: ffff888005733898 RBX: 0000000000000000 RCX: ffff8880056b3ef0
[   15.576842] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: ffff888005729600
[   15.576842] RBP: ffff8880056b3e48 R08: 00007ffc1f99b0a8 R09: 0000000000000000
[   15.576842] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888005469f00
[   15.576842] R13: ffff888005729600 R14: 0000000000000001 R15: 0000000000000000
[   15.576842] FS:  0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000
[   15.576842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.576842] CR2: ffffffffffffffd7 CR3: 0000000002434000 CR4: 00000000000006f0
[   15.576842] note: cat[120] exited with irqs disabled
[   15.595880] Fixing recursive fault but reboot is needed!
[   15.596287] BUG: scheduling while atomic: cat/120/0x00000000
[   15.596720] Modules linked in: rust_proc(E)
[   15.597039] CPU: 0 PID: 120 Comm: cat Tainted: G      D     E      6.3.0+ Rust-for-Linux#22
[   15.597587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   15.598226] Call Trace:
[   15.598420]  <TASK>
[   15.598592]  dump_stack_lvl+0x58/0x70
[   15.598883]  dump_stack+0x10/0x20
[   15.599141]  __schedule_bug+0x62/0x70
[   15.599449]  __schedule+0x838/0x1450
[   15.599731]  ? vprintk_default+0x1d/0x20
[   15.599849]  ? vprintk+0x60/0x80
[   15.599849]  ? _printk+0x4b/0x50
[   15.599849]  do_task_dead+0x41/0x50
[   15.599849]  make_task_dead+0x129/0x130
[   15.599849]  rewind_stack_and_make_dead+0x17/0x20
[   15.599849] RIP: 0033:0x4ad272
[   15.599849] Code: Unable to access opcode bytes at 0x4ad248.
[   15.599849] RSP: 002b:00007ffc1f99b048 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   15.599849] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004ad272
[   15.599849] RDX: 0000000000001000 RSI: 00007ffc1f99b0a8 RDI: 0000000000000003
[   15.599849] RBP: 00007ffc1f99b0a8 R08: 0000000000000001 R09: 0000000000000000
[   15.599849] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
[   15.599849] R13: 0000000001e153a0 R14: 0000000000000000 R15: 0000000000000001
[   15.599849]  </TASK>

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 8379593fa4bb..bea879760ebc 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -573,7 +573,8 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
        p = proc_create_reg(name, mode, &parent, data);
        if (!p)
                return NULL;
-       p->proc_ops = proc_ops;
+       printk(KERN_ERR "proc_create_data: %s proc_open=%px\n", name, proc_ops->proc_open);
+       p->proc_ops = proc_ops;
        pde_set_flags(p);
        return proc_register(parent, p);
 }
metaspace pushed a commit that referenced this issue Feb 7, 2024
We call bnxt_half_open_nic() to setup the chip partially to run
loopback tests.  The rings and buffers are initialized normally
so that we can transmit and receive packets in loopback mode.
That means page pool buffers are allocated for the aggregation ring
just like the normal case.  NAPI is not needed because we are just
polling for the loopback packets.

When we're done with the loopback tests, we call bnxt_half_close_nic()
to clean up.  When freeing the page pools, we hit a WARN_ON()
in page_pool_unlink_napi() because the NAPI state linked to the
page pool is uninitialized.

The simplest way to avoid this warning is just to initialize the
NAPIs during half open and delete the NAPIs during half close.
Trying to skip the page pool initialization or skip linking of
NAPI during half open will be more complicated.

This fix avoids this warning:

WARNING: CPU: 4 PID: 46967 at net/core/page_pool.c:946 page_pool_unlink_napi+0x1f/0x30
CPU: 4 PID: 46967 Comm: ethtool Tainted: G S      W          6.7.0-rc5+ #22
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.3.8 08/31/2021
RIP: 0010:page_pool_unlink_napi+0x1f/0x30
Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 18 48 85 c0 74 1b 48 8b 50 10 83 e2 01 74 08 8b 40 34 83 f8 ff 74 02 <0f> 0b 48 c7 47 18 00 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90
RSP: 0018:ffa000003d0dfbe8 EFLAGS: 00010246
RAX: ff110003607ce640 RBX: ff110010baf5d000 RCX: 0000000000000008
RDX: 0000000000000000 RSI: ff110001e5e522c0 RDI: ff110010baf5d000
RBP: ff11000145539b40 R08: 0000000000000001 R09: ffffffffc063f641
R10: ff110001361eddb8 R11: 000000000040000f R12: 0000000000000001
R13: 000000000000001c R14: ff1100014553a080 R15: 0000000000003fc0
FS:  00007f9301c4f740(0000) GS:ff1100103fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91344fa8f0 CR3: 00000003527cc005 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x81/0x140
 ? page_pool_unlink_napi+0x1f/0x30
 ? report_bug+0x102/0x200
 ? handle_bug+0x44/0x70
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? bnxt_free_ring.isra.123+0xb1/0xd0 [bnxt_en]
 ? page_pool_unlink_napi+0x1f/0x30
 page_pool_destroy+0x3e/0x150
 bnxt_free_mem+0x441/0x5e0 [bnxt_en]
 bnxt_half_close_nic+0x2a/0x40 [bnxt_en]
 bnxt_self_test+0x21d/0x450 [bnxt_en]
 __dev_ethtool+0xeda/0x2e30
 ? native_queued_spin_lock_slowpath+0x17f/0x2b0
 ? __link_object+0xa1/0x160
 ? _raw_spin_unlock_irqrestore+0x23/0x40
 ? __create_object+0x5f/0x90
 ? __kmem_cache_alloc_node+0x317/0x3c0
 ? dev_ethtool+0x59/0x170
 dev_ethtool+0xa7/0x170
 dev_ioctl+0xc3/0x530
 sock_do_ioctl+0xa8/0xf0
 sock_ioctl+0x270/0x310
 __x64_sys_ioctl+0x8c/0xc0
 do_syscall_64+0x3e/0xf0
 entry_SYSCALL_64_after_hwframe+0x6e/0x76

Fixes: 294e39e ("bnxt: hook NAPIs to page pools")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this issue Feb 16, 2024
When configuring a hugetlb filesystem via the fsconfig() syscall, there is
a possible NULL dereference in hugetlbfs_fill_super() caused by assigning
NULL to ctx->hstate in hugetlbfs_parse_param() when the requested pagesize
is non valid.

E.g: Taking the following steps:

     fd = fsopen("hugetlbfs", FSOPEN_CLOEXEC);
     fsconfig(fd, FSCONFIG_SET_STRING, "pagesize", "1024", 0);
     fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);

Given that the requested "pagesize" is invalid, ctxt->hstate will be replaced
with NULL, losing its previous value, and we will print an error:

 ...
 ...
 case Opt_pagesize:
 ps = memparse(param->string, &rest);
 ctx->hstate = h;
 if (!ctx->hstate) {
         pr_err("Unsupported page size %lu MB\n", ps / SZ_1M);
         return -EINVAL;
 }
 return 0;
 ...
 ...

This is a problem because later on, we will dereference ctxt->hstate in
hugetlbfs_fill_super()

 ...
 ...
 sb->s_blocksize = huge_page_size(ctx->hstate);
 ...
 ...

Causing below Oops.

Fix this by replacing cxt->hstate value only when then pagesize is known
to be valid.

 kernel: hugetlbfs: Unsupported page size 0 MB
 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
 kernel: #PF: supervisor read access in kernel mode
 kernel: #PF: error_code(0x0000) - not-present page
 kernel: PGD 800000010f66c067 P4D 800000010f66c067 PUD 1b22f8067 PMD 0
 kernel: Oops: 0000 [#1] PREEMPT SMP PTI
 kernel: CPU: 4 PID: 5659 Comm: syscall Tainted: G            E      6.8.0-rc2-default+ #22 5a47c3fef76212addcc6eb71344aabc35190ae8f
 kernel: Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
 kernel: FS:  00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0
 kernel: Call Trace:
 kernel:  <TASK>
 kernel:  ? __die_body+0x1a/0x60
 kernel:  ? page_fault_oops+0x16f/0x4a0
 kernel:  ? search_bpf_extables+0x65/0x70
 kernel:  ? fixup_exception+0x22/0x310
 kernel:  ? exc_page_fault+0x69/0x150
 kernel:  ? asm_exc_page_fault+0x22/0x30
 kernel:  ? __pfx_hugetlbfs_fill_super+0x10/0x10
 kernel:  ? hugetlbfs_fill_super+0xb4/0x1a0
 kernel:  ? hugetlbfs_fill_super+0x28/0x1a0
 kernel:  ? __pfx_hugetlbfs_fill_super+0x10/0x10
 kernel:  vfs_get_super+0x40/0xa0
 kernel:  ? __pfx_bpf_lsm_capable+0x10/0x10
 kernel:  vfs_get_tree+0x25/0xd0
 kernel:  vfs_cmd_create+0x64/0xe0
 kernel:  __x64_sys_fsconfig+0x395/0x410
 kernel:  do_syscall_64+0x80/0x160
 kernel:  ? syscall_exit_to_user_mode+0x82/0x240
 kernel:  ? do_syscall_64+0x8d/0x160
 kernel:  ? syscall_exit_to_user_mode+0x82/0x240
 kernel:  ? do_syscall_64+0x8d/0x160
 kernel:  ? exc_page_fault+0x69/0x150
 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0x76
 kernel: RIP: 0033:0x7ffbc0cb87c9
 kernel: Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 96 0d 00 f7 d8 64 89 01 48
 kernel: RSP: 002b:00007ffc29d2f388 EFLAGS: 00000206 ORIG_RAX: 00000000000001af
 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffbc0cb87c9
 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
 kernel: RBP: 00007ffc29d2f3b0 R08: 0000000000000000 R09: 0000000000000000
 kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
 kernel: R13: 00007ffc29d2f4c0 R14: 0000000000000000 R15: 0000000000000000
 kernel:  </TASK>
 kernel: Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) netfs(E) af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) intel_rapl_msr(E) intel_rapl_common(E) iTCO_wdt(E) intel_pmc_bxt(E) sb_edac(E) iTCO_vendor_support(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) rfkill(E) ipmi_ssif(E) kvm(E) acpi_ipmi(E) irqbypass(E) pcspkr(E) igb(E) ipmi_si(E) mei_me(E) i2c_i801(E) joydev(E) intel_pch_thermal(E) i2c_smbus(E) dca(E) lpc_ich(E) mei(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) tiny_power_button(E) button(E) fuse(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) ext4(E) mbcache(E) jbd2(E) hid_generic(E) usbhid(E) sd_mod(E) t10_pi(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) polyval_clmulni(E) ahci(E) xhci_pci(E) polyval_generic(E) gf128mul(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) xhci_pci_renesas(E) libahci(E) ehci_pci(E) sha1_ssse3(E) xhci_hcd(E) ehci_hcd(E) libata(E)
 kernel:  mgag200(E) i2c_algo_bit(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) scsi_common(E) aesni_intel(E) crypto_simd(E) cryptd(E)
 kernel: Unloaded tainted modules: acpi_cpufreq(E):1 fjes(E):1
 kernel: CR2: 0000000000000028
 kernel: ---[ end trace 0000000000000000 ]---
 kernel: RIP: 0010:hugetlbfs_fill_super+0xb4/0x1a0
 kernel: Code: 48 8b 3b e8 3e c6 ed ff 48 85 c0 48 89 45 20 0f 84 d6 00 00 00 48 b8 ff ff ff ff ff ff ff 7f 4c 89 e7 49 89 44 24 20 48 8b 03 <8b> 48 28 b8 00 10 00 00 48 d3 e0 49 89 44 24 18 48 8b 03 8b 40 28
 kernel: RSP: 0018:ffffbe9960fcbd48 EFLAGS: 00010246
 kernel: RAX: 0000000000000000 RBX: ffff9af5272ae780 RCX: 0000000000372004
 kernel: RDX: ffffffffffffffff RSI: ffffffffffffffff RDI: ffff9af555e9b000
 kernel: RBP: ffff9af52ee66b00 R08: 0000000000000040 R09: 0000000000370004
 kernel: R10: ffffbe9960fcbd48 R11: 0000000000000040 R12: ffff9af555e9b000
 kernel: R13: ffffffffa66b86c0 R14: ffff9af507d2f400 R15: ffff9af507d2f400
 kernel: FS:  00007ffbc0ba4740(0000) GS:ffff9b0bd7000000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 0000000000000028 CR3: 00000001b1ee0000 CR4: 00000000001506f0

Link: https://lkml.kernel.org/r/20240130210418.3771-1-osalvador@suse.de
Fixes: 3202198 ("hugetlbfs: Convert to fs_context")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ojeda pushed a commit that referenced this issue Mar 25, 2024
When the skb is reorganized during esp_output (!esp->inline), the pages
coming from the original skb fragments are supposed to be released back
to the system through put_page. But if the skb fragment pages are
originating from a page_pool, calling put_page on them will trigger a
page_pool leak which will eventually result in a crash.

This leak can be easily observed when using CONFIG_DEBUG_VM and doing
ipsec + gre (non offloaded) forwarding:

  BUG: Bad page state in process ksoftirqd/16  pfn:1451b6
  page:00000000de2b8d32 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1451b6000 pfn:0x1451b6
  flags: 0x200000000000000(node=0|zone=2)
  page_type: 0xffffffff()
  raw: 0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000
  raw: 00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000
  page dumped because: page_pool leak
  Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core]
  CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x36/0x50
   bad_page+0x70/0xf0
   free_unref_page_prepare+0x27a/0x460
   free_unref_page+0x38/0x120
   esp_ssg_unref.isra.0+0x15f/0x200
   esp_output_tail+0x66d/0x780
   esp_xmit+0x2c5/0x360
   validate_xmit_xfrm+0x313/0x370
   ? validate_xmit_skb+0x1d/0x330
   validate_xmit_skb_list+0x4c/0x70
   sch_direct_xmit+0x23e/0x350
   __dev_queue_xmit+0x337/0xba0
   ? nf_hook_slow+0x3f/0xd0
   ip_finish_output2+0x25e/0x580
   iptunnel_xmit+0x19b/0x240
   ip_tunnel_xmit+0x5fb/0xb60
   ipgre_xmit+0x14d/0x280 [ip_gre]
   dev_hard_start_xmit+0xc3/0x1c0
   __dev_queue_xmit+0x208/0xba0
   ? nf_hook_slow+0x3f/0xd0
   ip_finish_output2+0x1ca/0x580
   ip_sublist_rcv_finish+0x32/0x40
   ip_sublist_rcv+0x1b2/0x1f0
   ? ip_rcv_finish_core.constprop.0+0x460/0x460
   ip_list_rcv+0x103/0x130
   __netif_receive_skb_list_core+0x181/0x1e0
   netif_receive_skb_list_internal+0x1b3/0x2c0
   napi_gro_receive+0xc8/0x200
   gro_cell_poll+0x52/0x90
   __napi_poll+0x25/0x1a0
   net_rx_action+0x28e/0x300
   __do_softirq+0xc3/0x276
   ? sort_range+0x20/0x20
   run_ksoftirqd+0x1e/0x30
   smpboot_thread_fn+0xa6/0x130
   kthread+0xcd/0x100
   ? kthread_complete_and_exit+0x20/0x20
   ret_from_fork+0x31/0x50
   ? kthread_complete_and_exit+0x20/0x20
   ret_from_fork_asm+0x11/0x20
   </TASK>

The suggested fix is to introduce a new wrapper (skb_page_unref) that
covers page refcounting for page_pool pages as well.

Cc: stable@vger.kernel.org
Fixes: 6a5bcd8 ("page_pool: Allow drivers to hint on SKB recycling")
Reported-and-tested-by: Anatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz>
Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
Link: https://lore.kernel.org/netdev/CAA85sZvvHtrpTQRqdaOx6gd55zPAVsqMYk_Lwh4Md5knTq7AyA@mail.gmail.com
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ojeda pushed a commit that referenced this issue May 27, 2024
…uddy pages

When I did memory failure tests recently, below panic occurs:

page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:1009!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__del_page_from_free_list+0x151/0x180
RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __rmqueue_pcplist+0x23b/0x520
 get_page_from_freelist+0x26b/0xe40
 __alloc_pages_noprof+0x113/0x1120
 __folio_alloc_noprof+0x11/0xb0
 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
 __alloc_fresh_hugetlb_folio+0xe7/0x140
 alloc_pool_huge_folio+0x68/0x100
 set_max_huge_pages+0x13d/0x340
 hugetlb_sysctl_handler_common+0xe8/0x110
 proc_sys_call_handler+0x194/0x280
 vfs_write+0x387/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xc2/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff916114887
RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
 </TASK>
Modules linked in: mce_inject hwpoison_inject
---[ end trace 0000000000000000 ]---

And before the panic, there had an warning about bad page state:

BUG: Bad page state in process page-types  pfn:8cee00
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
page_type: 0xffffff7f(buddy)
raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
Call Trace:
 <TASK>
 dump_stack_lvl+0x83/0xa0
 bad_page+0x63/0xf0
 free_unref_page+0x36e/0x5c0
 unpoison_memory+0x50b/0x630
 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
 debugfs_attr_write+0x42/0x60
 full_proxy_write+0x5b/0x80
 vfs_write+0xcd/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xc2/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f189a514887
RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
 </TASK>

The root cause should be the below race:

 memory_failure
  try_memory_failure_hugetlb
   me_huge_page
    __page_handle_poison
     dissolve_free_hugetlb_folio
     drain_all_pages -- Buddy page can be isolated e.g. for compaction.
     take_page_off_buddy -- Failed as page is not in the buddy list.
	     -- Page can be putback into buddy after compaction.
    page_ref_inc -- Leads to buddy page with refcnt = 1.

Then unpoison_memory() can unpoison the page and send the buddy page back
into buddy list again leading to the above bad page state warning.  And
bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
allocate this page.

Fix this issue by only treating __page_handle_poison() as successful when
it returns 1.

Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
Fixes: ceaf8fb ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darksonn pushed a commit to Darksonn/linux that referenced this issue Oct 15, 2024
When a binder reference is cleaned up, any freeze work queued in the
associated process should also be removed. Otherwise, the reference is
freed while its ref->freeze.work is still queued in proc->work leading
to a use-after-free issue as shown by the following KASAN report:

  ==================================================================
  BUG: KASAN: slab-use-after-free in binder_release_work+0x398/0x3d0
  Read of size 8 at addr ffff31600ee91488 by task kworker/5:1/211

  CPU: 5 UID: 0 PID: 211 Comm: kworker/5:1 Not tainted 6.11.0-rc7-00382-gfc6c92196396 Rust-for-Linux#22
  Hardware name: linux,dummy-virt (DT)
  Workqueue: events binder_deferred_func
  Call trace:
   binder_release_work+0x398/0x3d0
   binder_deferred_func+0xb60/0x109c
   process_one_work+0x51c/0xbd4
   worker_thread+0x608/0xee8

  Allocated by task 703:
   __kmalloc_cache_noprof+0x130/0x280
   binder_thread_write+0xdb4/0x42a0
   binder_ioctl+0x18f0/0x25ac
   __arm64_sys_ioctl+0x124/0x190
   invoke_syscall+0x6c/0x254

  Freed by task 211:
   kfree+0xc4/0x230
   binder_deferred_func+0xae8/0x109c
   process_one_work+0x51c/0xbd4
   worker_thread+0x608/0xee8
  ==================================================================

This commit fixes the issue by ensuring any queued freeze work is removed
when cleaning up a binder reference.

Fixes: d579b04 ("binder: frozen notification")
Cc: stable@vger.kernel.org
Acked-by: Todd Kjos <tkjos@android.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20240926233632.821189-4-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ojeda pushed a commit that referenced this issue Dec 16, 2024
This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:

[   41.585683] ============================================
[   41.585688] WARNING: possible recursive locking detected
[   41.585694] 6.12.0-rc6+ #22 Not tainted
[   41.585701] --------------------------------------------
[   41.585705] iso-tester/3139 is trying to acquire lock:
[   41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
               at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[   41.585905]
               but task is already holding lock:
[   41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
               at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[   41.586064]
               other info that might help us debug this:
[   41.586069]  Possible unsafe locking scenario:

[   41.586072]        CPU0
[   41.586076]        ----
[   41.586079]   lock(sk_lock-AF_BLUETOOTH);
[   41.586086]   lock(sk_lock-AF_BLUETOOTH);
[   41.586093]
                *** DEADLOCK ***

[   41.586097]  May be due to missing lock nesting notation

[   41.586101] 1 lock held by iso-tester/3139:
[   41.586107]  #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
                at: iso_sock_accept+0x61/0x2d0 [bluetooth]

Fixes: ccf74f2 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
ojeda pushed a commit that referenced this issue Dec 16, 2024
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.

[   75.307983] ======================================================
[   75.307984] WARNING: possible circular locking dependency detected
[   75.307985] 6.12.0-rc6+ #22 Not tainted
[   75.307987] ------------------------------------------------------
[   75.307987] kworker/u81:2/2623 is trying to acquire lock:
[   75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
               at: iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308021]
               but task is already holding lock:
[   75.308022] ffff8fdd61a10078 (&hdev->lock)
               at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308053]
               which lock already depends on the new lock.

[   75.308054]
               the existing dependency chain (in reverse order) is:
[   75.308055]
               -> #1 (&hdev->lock){+.+.}-{3:3}:
[   75.308057]        __mutex_lock+0xad/0xc50
[   75.308061]        mutex_lock_nested+0x1b/0x30
[   75.308063]        iso_sock_listen+0x143/0x5c0 [bluetooth]
[   75.308085]        __sys_listen_socket+0x49/0x60
[   75.308088]        __x64_sys_listen+0x4c/0x90
[   75.308090]        x64_sys_call+0x2517/0x25f0
[   75.308092]        do_syscall_64+0x87/0x150
[   75.308095]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.308098]
               -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[   75.308100]        __lock_acquire+0x155e/0x25f0
[   75.308103]        lock_acquire+0xc9/0x300
[   75.308105]        lock_sock_nested+0x32/0x90
[   75.308107]        iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308128]        hci_connect_cfm+0x6c/0x190 [bluetooth]
[   75.308155]        hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[   75.308180]        hci_le_meta_evt+0xe7/0x200 [bluetooth]
[   75.308206]        hci_event_packet+0x21f/0x5c0 [bluetooth]
[   75.308230]        hci_rx_work+0x3ae/0xb10 [bluetooth]
[   75.308254]        process_one_work+0x212/0x740
[   75.308256]        worker_thread+0x1bd/0x3a0
[   75.308258]        kthread+0xe4/0x120
[   75.308259]        ret_from_fork+0x44/0x70
[   75.308261]        ret_from_fork_asm+0x1a/0x30
[   75.308263]
               other info that might help us debug this:

[   75.308264]  Possible unsafe locking scenario:

[   75.308264]        CPU0                CPU1
[   75.308265]        ----                ----
[   75.308265]   lock(&hdev->lock);
[   75.308267]                            lock(sk_lock-
                                                AF_BLUETOOTH-BTPROTO_ISO);
[   75.308268]                            lock(&hdev->lock);
[   75.308269]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[   75.308270]
                *** DEADLOCK ***

[   75.308271] 4 locks held by kworker/u81:2/2623:
[   75.308272]  #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
                at: process_one_work+0x443/0x740
[   75.308276]  #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
                at: process_one_work+0x1ce/0x740
[   75.308280]  #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
                at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308304]  #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
                at: hci_connect_cfm+0x29/0x190 [bluetooth]

Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
metaspace pushed a commit to metaspace/linux that referenced this issue Dec 17, 2024
commit 7e20434 upstream.

When a binder reference is cleaned up, any freeze work queued in the
associated process should also be removed. Otherwise, the reference is
freed while its ref->freeze.work is still queued in proc->work leading
to a use-after-free issue as shown by the following KASAN report:

  ==================================================================
  BUG: KASAN: slab-use-after-free in binder_release_work+0x398/0x3d0
  Read of size 8 at addr ffff31600ee91488 by task kworker/5:1/211

  CPU: 5 UID: 0 PID: 211 Comm: kworker/5:1 Not tainted 6.11.0-rc7-00382-gfc6c92196396 Rust-for-Linux#22
  Hardware name: linux,dummy-virt (DT)
  Workqueue: events binder_deferred_func
  Call trace:
   binder_release_work+0x398/0x3d0
   binder_deferred_func+0xb60/0x109c
   process_one_work+0x51c/0xbd4
   worker_thread+0x608/0xee8

  Allocated by task 703:
   __kmalloc_cache_noprof+0x130/0x280
   binder_thread_write+0xdb4/0x42a0
   binder_ioctl+0x18f0/0x25ac
   __arm64_sys_ioctl+0x124/0x190
   invoke_syscall+0x6c/0x254

  Freed by task 211:
   kfree+0xc4/0x230
   binder_deferred_func+0xae8/0x109c
   process_one_work+0x51c/0xbd4
   worker_thread+0x608/0xee8
  ==================================================================

This commit fixes the issue by ensuring any queued freeze work is removed
when cleaning up a binder reference.

Fixes: d579b04 ("binder: frozen notification")
Cc: stable@vger.kernel.org
Acked-by: Todd Kjos <tkjos@android.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20240926233632.821189-4-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants