Pelt #1

hxsyzl · 2025-01-24T11:07:21Z

No description provided.

Delegatable cgroup v2 control files may require special handling (e.g. chowning), and the exact list of such files varies between kernel versions (and likely to be extended in the future). To guarantee correctness of this list and simplify the life of userspace (systemd, first of all), let's export the list via /sys/kernel/cgroup/delegate pseudo-file. Format is siple: each control file name is printed on a new line. Example: $ cat /sys/kernel/cgroup/delegate cgroup.procs cgroup.subtree_control Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I9d3143ecbae9d7579d2b1e6ccf381190ef5d3255 (cherry picked from commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The active development of cgroups v2 sometimes leads to a creation of interfaces, which are not turned on by default (to provide backward compatibility). It's handy to know from userspace, which cgroup v2 features are supported without calculating it based on the kernel version. So, let's export the list of such features using /sys/kernel/cgroup/features pseudo-file. The list is hardcoded and has to be extended when new functionality is added. Each feature is printed on a new line. Example: $ cat /sys/kernel/cgroup/features nsdelegate Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I2baf0b7bcc27491568772defc43a06d0a5ed46bf (cherry picked from commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

cgroup root name and file name have max length limit, we should avoid copying longer name than that to the name. tj: minor update to $SUBJ. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: Iff4f30be79184f19d9f3ec253bbab9c4ad91f36c (cherry picked from commit e7fd37ba12170cc414be8b639dfc2c5f7172fac2) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

…s warning As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would work just as well and avoid that warning, so the change below could be folded into that commit. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I8215beea12d94fda6a7834f8be6f8e0891285d0e (cherry picked from commit 50034ed49645463a16327cad05694e201e6b4126) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

…y() usages in cgroup e7fd37ba1217 ("cgroup: avoid copying strings longer than the buffers") converted possibly unsafe strncpy() usages in cgroup to strscpy(). However, although the callsites are completely fine with truncated copied, because strscpy() is marked __must_check, it led to the following warnings. kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’: kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result] strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX); ^ To avoid the warnings, 50034ed49645 ("cgroup: use strlcpy() instead of strscpy() to avoid spurious warning") switched them to strlcpy(). strlcpy() is worse than strlcpy() because it unconditionally runs strlen() on the source string, and the only reason we switched to strlcpy() here was because it was lacking __must_check, which doesn't reflect any material differences between the two function. It's just that someone added __must_check to strscpy() and not to strlcpy(). These basic string copy operations are used in variety of ways, and one of not-so-uncommon use cases is safely handling truncated copies, where the caller naturally doesn't care about the return value. The __must_check doesn't match the actual use cases and forces users to opt for inferior variants which lack __must_check by happenstance or spread ugly (void) casts. Remove __must_check from strscpy() and restore strscpy() usages in cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> (cherry picked from commit 08a77676f9c5fc69a681ccd2cd8140e65dcb26c7) [backport the cgroup portions that weren't applied with the earlier patch 779128d 'string: drop __must_check from strscpy() and restore strscpy() usages in cgroup'] Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com> Change-Id: Iaa636d39d15c44be47fc6b6ba202ecb7ff73c5e7

Make cgroup.threads file delegatable. The behavior of cgroup.threads should follow the behavior of cgroup.procs. Signed-off-by: Roman Gushchin <guro@fb.com> Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I82d23cd511122e5a75b23b26e03ccc9e43b171e5 (cherry picked from commit 4f58424da3deead2605e39a9df65f5f06107a3cb) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The cgroup_subsys structure references a documentation file that has been renamed after the v1/v2 split. Since the v2 documentation doesn't currently contain any information on kernel interfaces for controllers, point the user to the v1 docs. Cc: Tejun Heo <tj@kernel.org> Cc: linux-doc@vger.kernel.org Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I81c2866f6a192547e373279911b37d304ba22d1a (cherry picked from commit 392536b731cfe82eea414f4b09c128ef37cd477e) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The "cgroup." core interface files bypass the usual interface removal path and get removed recursively along with the cgroup itself. While this works now, the subtle discrepancy gets in the way of implementing common mechanisms. This patch updates cgroup core interface file handling so that it's consistent with controller interface files. When added, the css is marked CSS_VISIBLE and they're explicitly removed before the cgroup is destroyed. This doesn't cause user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I4091581388cb1514171d6de8fdac5f0fe6ae1695 (cherry picked from commit 5faaf05f2976fd9ec0ecd582bcfb3a41cde4c65e) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

Simplify cgroup_ancestor function. This is follow-up for commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I9e96704713f34fbc68e92b9f91c01b593708220f Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com> (cherry picked from commit 808c43b7c7f70360ed7b9e43e2cf980f388e71fa) This cherry pick differs from the original in that cgroup_ancestor is added in place of being just modified. The patch originally introducing the function was 7723628101aae (bpf: Introduce bpf_skb_ancestor_cgroup_id helper) which also relied on bpf dependencies not present in android-4.14. cgroup_ancestor is independent from the bpf_skb code and can hence be taken alone

WARN_ON() already contains an unlikely(), so it's not necessary to use unlikely. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I092c0aae2a06b13d3fc9ecfbb24ab3e8d10235f6 (cherry picked from commit 4d9ebbe2b061a9c25e12ba8539ba172533132eb6) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

…param It can be useful to inhibit all cgroup1 hierarchies especially during transition and for debugging. cgroup_no_v1 can block hierarchies with controllers which leaves out the named hierarchies. Expand it to cover the named hierarchies so that "cgroup_no_v1=all,named" disables all cgroup1 hierarchies. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Marcin Pawlowski <mpawlowski@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: Ibd093dd9b70d15402a21db3c1ef56005ebc7f99e (cherry picked from commit 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

* make the reference from superblock to cgroup_root counting - do cgroup_put() in cgroup_kill_sb() whether we'd done percpu_ref_kill() or not; matching grab is done when we allocate a new root. That gives the same refcounting rules for all callers of cgroup_do_mount() - a reference to cgroup_root has been grabbed by caller and it either is transferred to new superblock or dropped. * have cgroup_kill_sb() treat an already killed refcount as "just don't bother killing it, then". * after successful cgroup_do_mount() have cgroup1_mount() recheck if we'd raced with mount/umount from somebody else and cgroup_root got killed. In that case we drop the superblock and bugger off with -ERESTARTSYS, same as if we'd found it in the list already dying. * don't bother with delayed initialization of refcount - it's unreliable and not needed. No need to prevent attempts to bump the refcount if we find cgroup_root of another mount in progress - sget will reuse an existing superblock just fine and if the other sb manages to die before we get there, we'll catch that immediately after cgroup_do_mount(). * don't bother with kernfs_pin_sb() - no need for doing that either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I8e088dfc516b76c42d9d4b34db7f49f0cebc5414 (cherry picked from commit 35ac1184244f1329783e1d897f74926d8bb1103a) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The callers of cgroup_migrate_prepare_dst() correctly call cgroup_migrate_finish() for success and failure cases both. No need to call it in cgroup_migrate_prepare_dst() in failure case. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I785d7ab70a42b1b79aea9852bb14ba5abefcaa9b (cherry picked from commit d6e486ee0ef2f99a4069d9186e53dac61b28cb3c) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

Freezer.c will contain an implementation of cgroup v2 freezer, so let's rename the v1 freezer to avoid naming conflicts. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Change-Id: Ie196fbcca1e0bf46af9200752d8fdf90b97e5a8b (cherry picked from commit 50943f3e136adfc421f9768d6ae09ba7b83aaefd) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The helper is identical to the existing cgroup_task_count() except it doesn't take the css_set_lock by itself, assuming that the caller does. Also, move cgroup_task_count() implementation into kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Change-Id: Iaa9085d2375d395a051543d2555389213c2892d6 (cherry picked from commit aade7f9efba098859681f8e88d81a5b44ad09b12) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

Cgroup v1 implements the freezer controller, which provides an ability to stop the workload in a cgroup and temporarily free up some resources (cpu, io, network bandwidth and, potentially, memory) for some other tasks. Cgroup v2 lacks this functionality. This patch implements freezer for cgroup v2. Cgroup v2 freezer tries to put tasks into a state similar to jobctl stop. This means that tasks can be killed, ptraced (using PTRACE_SEIZE*), and interrupted. It is possible to attach to a frozen task, get some information (e.g. read registers) and detach. It's also possible to migrate a frozen tasks to another cgroup. This differs cgroup v2 freezer from cgroup v1 freezer, which mostly tried to imitate the system-wide freezer. However uninterruptible sleep is fine when all tasks are going to be frozen (hibernation case), it's not the acceptable state for some subset of the system. Cgroup v2 freezer is not supporting freezing kthreads. If a non-root cgroup contains kthread, the cgroup still can be frozen, but the kthread will remain running, the cgroup will be shown as non-frozen, and the notification will not be delivered. * PTRACE_ATTACH is not working because non-fatal signal delivery is blocked in frozen state. There are some interface differences between cgroup v1 and cgroup v2 freezer too, which are required to conform the cgroup v2 interface design principles: 1) There is no separate controller, which has to be turned on: the functionality is always available and is represented by cgroup.freeze and cgroup.events cgroup control files. 2) The desired state is defined by the cgroup.freeze control file. Any hierarchical configuration is allowed. 3) The interface is asynchronous. The actual state is available using cgroup.events control file ("frozen" field). There are no dedicated transitional states. 4) It's allowed to make any changes with the cgroup hierarchy (create new cgroups, remove old cgroups, move tasks between cgroups) no matter if some cgroups are frozen. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@fb.com Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa (cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

…op() Alex Xu reported a regression in strace, caused by the introduction of the cgroup v2 freezer. The regression can be reproduced by stracing the following simple program: #include <unistd.h> int main() { write(1, "a", 1); return 0; } An attempt to run strace ./a.out leads to the infinite loop: [ pre-main omitted ] write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) [ repeats forever ] The problem occurs because the traced task leaves ptrace_stop() (and the signal handling loop) with the frozen bit set. So let's call cgroup_leave_frozen(true) unconditionally after sleeping in ptrace_stop(). With this patch applied, strace works as expected: [ pre-main omitted ] write(1, "a", 1) = 1 exit_group(0) = ? +++ exited with 0 +++ Reported-by: Alex Xu <alex_y_xu@yahoo.ca> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: If644b15ead36ce13f0c2c3dd57eebe3658e3edf7 (cherry picked from commit 05b289263772b0698589abc47771264a685cd365) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

If a new child cgroup is created in the frozen cgroup hierarchy (one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup flag should be set. Otherwise if a process will be attached to the child cgroup, it won't become frozen. The problem can be reproduced with the test_cgfreezer_mkdir test. This is the output before this patch: ~/test_freezer ok 1 test_cgfreezer_simple ok 2 test_cgfreezer_tree ok 3 test_cgfreezer_forkbomb Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen not ok 4 test_cgfreezer_mkdir ok 5 test_cgfreezer_rmdir ok 6 test_cgfreezer_migrate ok 7 test_cgfreezer_ptrace ok 8 test_cgfreezer_stopped ok 9 test_cgfreezer_ptraced ok 10 test_cgfreezer_vfork And with this patch: ~/test_freezer ok 1 test_cgfreezer_simple ok 2 test_cgfreezer_tree ok 3 test_cgfreezer_forkbomb ok 4 test_cgfreezer_mkdir ok 5 test_cgfreezer_rmdir ok 6 test_cgfreezer_migrate ok 7 test_cgfreezer_ptrace ok 8 test_cgfreezer_stopped ok 9 test_cgfreezer_ptraced ok 10 test_cgfreezer_vfork Reported-by: Mark Crossen <mcrossen@fb.com> Signed-off-by: Roman Gushchin <guro@fb.com> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Cc: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I6ba7b8dec5600e78bb7448f03fd97a9b43838fa0 (cherry picked from commit 97a61369830ab085df5aed0ff9256f35b07d425a) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

… disabled in ptrace_stop() ptrace_stop() does preempt_enable_no_resched() to avoid the preemption, but after that cgroup_enter_frozen() does spin_lock/unlock and this adds another preemption point. Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: Ic53e0f2d6624b0bb90817b0c57060fb7db971348 (cherry picked from commit 937c6b27c73e02cd4114f95f5c37ba2c29fadba1) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com>

The 'cgrp' is set but not used in commit <76f969e8948d8> ("cgroup: cgroup v2 freezer"). Remove it to avoid [-Wunused-but-set-variable] warning. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from 533307dc20a9e84a0687d4ca24aeb669516c0243) Bug: 154548692 Signed-off-by: Marco Ballesio <balejs@google.com> Change-Id: I6221a975c04f06249a4f8d693852776ae08a8d8e

Idea is from YaroST12 @ GitHub Signed-off-by: Julian Liu <wlootlxt123@gmail.com> Change-Id: Ia85ed9ebdc2fcbeecef8126aac523d6d8cffdaa3

Signed-off-by: Danny Lin <danny@kdrag0n.dev>

…g polled commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream. With write operation on psi files replacing old trigger with a new one, the lifetime of its waitqueue is totally arbitrary. Overwriting an existing trigger causes its waitqueue to be freed and pending poll() will stumble on trigger->event_wait which was destroyed. Fix this by disallowing to redefine an existing psi trigger. If a write operation is used on a file descriptor with an already existing psi trigger, the operation will fail with EBUSY error. Also bypass a check for psi_disabled in the psi_trigger_destroy as the flag can be flipped after the trigger is created, leading to a memory leak. Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Analyzed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com [surenb: backported to 5.10 kernel] CC: stable@vger.kernel.org # 5.10 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: include/linux/psi.h kernel/cgroup/cgroup.c kernel/sched/psi.c 1. Resolved trivial merge conflicts. Bug: 233410456 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I7143fef51b874c2df8d792808b6a9b666eec2c7b

The backports of 0d2b5955b36250a9428c832664f2079cbf723bec upstream commit to 4.14 and 4.19 stable kernels drop changes to cgroup_pressure_*() functions which breaks PSI cgroup file handlers in Android Common Kernels. The partial backport changes cgroup_file_open to allocate struct cgroup_file_ctx and use kernfs_open_file.priv to point to it while skipping the accompanying changes in cgroup_pressure_*(). This leads to cgroup_pressure_*() functions treating kernfs_open_file.priv as a pointer to struct psi_trigger instead of struct cgroup_file_ctx. This partial backport works fine in upstream stable kernels because they are missing PSI feature there, however in Android, PSI is backported in 4.14 and 4.19 kernels and therefore the missing pieces result in PSI feature being broken. Fix this by adding the dropped pieces from the original patch. Link to the original patch: https://lore.kernel.org/r/20211213191833.916632-3-tj@kernel.org/ Link to the 4.14 backport: https://lore.kernel.org/r/20220418121219.794680750@linuxfoundation.org/ Bug: 233410456 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ib8858fa85a7a1fb82904cea0c8c903fd900d6316

This reverts commit 261e023.

This reverts commit d7870bc.

This reverts commit f3c6fc9.

…ical" This reverts commit 2b37b57.

This reverts commit 0a67b00.

We don't need to hold the local pinctrl lock here to set irq wake on the summary irq line. Doing so only leads to lockdep warnings instead of protecting us from anything. Remove the locking. WARNING: possible circular locking dependency detected 5.4.11 #2 Tainted: G W ------------------------------------------------------ cat/3083 is trying to acquire lock: ffffff81f4fa58c0 (&irq_desc_lock_class){-.-.}, at: __irq_get_desc_lock+0x64/0x94 but task is already holding lock: ffffff81f4880c18 (&pctrl->lock){-.-.}, at: msm_gpio_irq_set_wake+0x48/0x7c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&pctrl->lock){-.-.}: _raw_spin_lock_irqsave+0x64/0x80 msm_gpio_irq_ack+0x68/0xf4 __irq_do_set_handler+0xe0/0x180 __irq_set_handler+0x60/0x9c irq_domain_set_info+0x90/0xb4 gpiochip_hierarchy_irq_domain_alloc+0x110/0x200 __irq_domain_alloc_irqs+0x130/0x29c irq_create_fwspec_mapping+0x1f0/0x300 irq_create_of_mapping+0x70/0x98 of_irq_get+0xa4/0xd4 spi_drv_probe+0x4c/0xb0 really_probe+0x138/0x3f0 driver_probe_device+0x70/0x140 __device_attach_driver+0x9c/0x110 bus_for_each_drv+0x88/0xd0 __device_attach+0xb0/0x160 device_initial_probe+0x20/0x2c bus_probe_device+0x34/0x94 device_add+0x35c/0x3f0 spi_add_device+0xbc/0x194 of_register_spi_devices+0x2c8/0x408 spi_register_controller+0x57c/0x6fc spi_geni_probe+0x260/0x328 platform_drv_probe+0x90/0xb0 really_probe+0x138/0x3f0 driver_probe_device+0x70/0x140 device_driver_attach+0x4c/0x6c __driver_attach+0xcc/0x154 bus_for_each_dev+0x84/0xcc driver_attach+0x2c/0x38 bus_add_driver+0x108/0x1fc driver_register+0x64/0xf8 __platform_driver_register+0x4c/0x58 spi_geni_driver_init+0x1c/0x24 do_one_initcall+0x1a4/0x3e8 do_initcall_level+0xb4/0xcc do_basic_setup+0x30/0x48 kernel_init_freeable+0x124/0x1a8 kernel_init+0x14/0x100 ret_from_fork+0x10/0x18 -> #0 (&irq_desc_lock_class){-.-.}: __lock_acquire+0xeb4/0x2388 lock_acquire+0x1cc/0x210 _raw_spin_lock_irqsave+0x64/0x80 __irq_get_desc_lock+0x64/0x94 irq_set_irq_wake+0x40/0x144 msm_gpio_irq_set_wake+0x5c/0x7c set_irq_wake_real+0x40/0x5c irq_set_irq_wake+0x70/0x144 cros_ec_rtc_suspend+0x38/0x4c platform_pm_suspend+0x34/0x60 dpm_run_callback+0x64/0xcc __device_suspend+0x310/0x41c dpm_suspend+0xf8/0x298 dpm_suspend_start+0x84/0xb4 suspend_devices_and_enter+0xbc/0x620 pm_suspend+0x210/0x348 state_store+0xb0/0x108 kobj_attr_store+0x14/0x24 sysfs_kf_write+0x4c/0x64 kernfs_fop_write+0x15c/0x1fc __vfs_write+0x54/0x18c vfs_write+0xe4/0x1a4 ksys_write+0x7c/0xe4 __arm64_sys_write+0x20/0x2c el0_svc_common+0xa8/0x160 el0_svc_handler+0x7c/0x98 el0_svc+0x8/0xc other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&pctrl->lock); lock(&irq_desc_lock_class); lock(&pctrl->lock); lock(&irq_desc_lock_class); *** DEADLOCK *** 7 locks held by cat/3083: #0: ffffff81f06d1420 (sb_writers#7){.+.+}, at: vfs_write+0xd0/0x1a4 #1: ffffff81c8935680 (&of->mutex){+.+.}, at: kernfs_fop_write+0x12c/0x1fc #2: ffffff81f4c322f0 (kn->count#337){.+.+}, at: kernfs_fop_write+0x134/0x1fc #3: ffffffe89a641d60 (system_transition_mutex){+.+.}, at: pm_suspend+0x108/0x348 #4: ffffff81f190e970 (&dev->mutex){....}, at: __device_suspend+0x168/0x41c #5: ffffff81f183d8c0 (lock_class){-.-.}, at: __irq_get_desc_lock+0x64/0x94 #6: ffffff81f4880c18 (&pctrl->lock){-.-.}, at: msm_gpio_irq_set_wake+0x48/0x7c stack backtrace: CPU: 4 PID: 3083 Comm: cat Tainted: G W 5.4.11 #2 Hardware name: Google Cheza (rev3+) (DT) Call trace: dump_backtrace+0x0/0x174 show_stack+0x20/0x2c dump_stack+0xc8/0x124 print_circular_bug+0x2ac/0x2c4 check_noncircular+0x1a0/0x1a8 __lock_acquire+0xeb4/0x2388 lock_acquire+0x1cc/0x210 _raw_spin_lock_irqsave+0x64/0x80 __irq_get_desc_lock+0x64/0x94 irq_set_irq_wake+0x40/0x144 msm_gpio_irq_set_wake+0x5c/0x7c set_irq_wake_real+0x40/0x5c irq_set_irq_wake+0x70/0x144 cros_ec_rtc_suspend+0x38/0x4c platform_pm_suspend+0x34/0x60 dpm_run_callback+0x64/0xcc __device_suspend+0x310/0x41c dpm_suspend+0xf8/0x298 dpm_suspend_start+0x84/0xb4 suspend_devices_and_enter+0xbc/0x620 pm_suspend+0x210/0x348 state_store+0xb0/0x108 kobj_attr_store+0x14/0x24 sysfs_kf_write+0x4c/0x64 kernfs_fop_write+0x15c/0x1fc __vfs_write+0x54/0x18c vfs_write+0xe4/0x1a4 ksys_write+0x7c/0xe4 __arm64_sys_write+0x20/0x2c el0_svc_common+0xa8/0x160 el0_svc_handler+0x7c/0x98 el0_svc+0x8/0xc Fixes: 6aced33 ("pinctrl: msm: drop wake_irqs bitmap") Cc: Douglas Anderson <dianders@chromium.org> Cc: Brian Masney <masneyb@onstation.org> Cc: Lina Iyer <ilina@codeaurora.org> Cc: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/20200121180950.36959-1-swboyd@chromium.org Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ahmad Thoriq Najahi <najahi@chips-projects.xyz>

When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a HARDIRQ-safe->HARDIRQ-unsafe deadlock: ================================ WARNING: inconsistent lock state 4.16.0-rc4+ #1 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. takes: __schedule+0xbe/0xaf0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock+0x2a/0x40 scheduler_tick+0x47/0xf0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->lock); <Interrupt> lock(&rq->lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/724: rcu_torture_read_lock+0x0/0x70 stack backtrace: CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: lock_acquire+0x90/0x200 ? __schedule+0xbe/0xaf0 _raw_spin_lock+0x2a/0x40 ? __schedule+0xbe/0xaf0 __schedule+0xbe/0xaf0 preempt_schedule_irq+0x2f/0x60 retint_kernel+0x1b/0x2d RIP: 0010:rcu_read_unlock_special+0x0/0x680 ? rcu_torture_read_unlock+0x60/0x60 __rcu_read_unlock+0x64/0x70 rcu_torture_read_unlock+0x17/0x60 rcu_torture_reader+0x275/0x450 ? rcutorture_booster_init+0x110/0x110 ? rcu_torture_stall+0x230/0x230 ? kthread+0x10e/0x130 kthread+0x10e/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? call_usermodehelper_exec_async+0x11a/0x150 ret_from_fork+0x3a/0x50 This happens with the following even sequence: preempt_schedule_irq(); local_irq_enable(); __schedule(): local_irq_disable(); // irq off ... rcu_note_context_switch(): rcu_note_preempt_context_switch(): rcu_read_unlock_special(): local_irq_save(flags); ... raw_spin_unlock_irqrestore(...,flags); // irq remains off rt_mutex_futex_unlock(): raw_spin_lock_irq(); ... raw_spin_unlock_irq(); // accidentally set irq on <return to __schedule()> rq_lock(): raw_spin_lock(); // acquiring rq->lock with irq on which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause deadlocks in scheduler code. This problem was introduced by commit 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints"). That brought the user of rt_mutex_futex_unlock() with irq off. To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock() with irq off. Fixes: 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints") Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com Signed-off-by: Danny Lin <danny@kdrag0n.dev>

Patch series "mm: zap pages with read mmap_sem in munmap for large mapping", v11. Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. History: Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). V1 patch series was submitted to the mailing list per Andrew's suggestion (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback and suggestions. Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko suggested (also in the v1 patches review) to try "two phases" approach. Zapping pages with read mmap_sem, then doing via cleanup with write mmap_sem (for discussion detail, see https://lwn.net/Articles/753269/) Approach: Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, we can't call MADV_DONTNEED directly, since there are two major drawbacks: * The unexpected state from PF if it wins the race in the middle of munmap. It may return zero page, instead of the content or SIGSEGV. * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which is a showstopper from akpm But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by falling back to regular do_munmap() with exclusive mmap_sem held in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. So, uprobe unmap will not be handled by the regular path. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. This patch (of 3): When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [<ffffffff817154d0>] ? __schedule+0x250/0x730 [<ffffffff817159e6>] schedule+0x36/0x80 [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150 [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffff81717db0>] down_read+0x20/0x40 [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0 [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100 [<ffffffff81241d87>] __vfs_read+0x37/0x150 [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0 [<ffffffff81242266>] vfs_read+0x96/0x130 [<ffffffff812437b5>] SyS_read+0x55/0xc0 [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem exclusively from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Zapping pages is the most time consuming part, according to the suggestion from Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. But, some part may need write mmap_sem, for example, vma splitting. So, the design is as follows: acquire write mmap_sem lookup vmas (find and split vmas) deal with special mappings detach vmas downgrade_write zap pages free page tables release mmap_sem The vm events with read mmap_sem may come in during page zapping, but since vmas have been detached before, they, i.e. page fault, gup, etc, will not be able to find valid vma, then just return SIGSEGV or -EFAULT as expected. If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special mappings. They will be handled by without downgrading mmap_sem in this patch since they may update vm flags. But, with the "detach vmas first" approach, the vmas have been detached when vm flags are updated, so it sounds safe to update vm flags with read mmap_sem for this specific case. So, VM_HUGETLB and VM_PFNMAP will be handled by using the optimized path in the following separate patches for bisectable sake. Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES). However it is fine to have false-positive MMF_RECALC_UPROBES according to uprobes developer. With the "detach vmas first" approach we don't have to re-acquire mmap_sem again to clean up vmas to avoid race window which might get the address space changed since downgrade_write() doesn't release the lock to lead regression, which simply downgrades to read lock. And, since the lock acquire/release cost is managed to the minimum and almost as same as before, the optimization could be extended to any size of mapping without incurring significant penalty to small mappings. For the time being, just do this in munmap syscall path. Other vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain intact due to some implementation difficulties since they acquire write mmap_sem from very beginning and hold it until the end, do_munmap() might be called in the middle. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. mremap might be called on very large area depending on the usecases, the optimization to it will be considered in the future. With the patches, exclusive mmap_sem hold time when munmap a 80GB address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level from second. munmap_test-15002 [008] 594.380138: funcgraph_entry: | __vm_munmap() { munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } Here the execution time of unmap_region() is used to evaluate the time of holding read mmap_sem, then the remaining time is used with holding exclusive lock. [1] https://lwn.net/Articles/753269/ Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>Suggested-by: Michal Hocko <mhocko@kernel.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Alex Winkowski <dereference23@outlook.com>

Aneesh reported that: tlb_flush_mmu() tlb_flush_mmu_tlbonly() tlb_flush() <-- #1 tlb_flush_mmu_free() tlb_table_flush() tlb_table_invalidate() tlb_flush_mmu_tlbonly() tlb_flush() <-- #2 does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not clear tlb->end in that case. Observe that any caller to __tlb_adjust_range() also sets at least one of the tlb->freed_tables || tlb->cleared_p* bits, and those are unconditionally cleared by __tlb_reset_range(). Change the condition for actually issuing TLBI to having one of those bits set, as opposed to having tlb->end != 0. Link: http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.kumar@linux.ibm.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Alex Winkowski <dereference23@outlook.com>

The syzbot reported the below general protection fault: general protection fault, probably for non-canonical address 0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df] CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0 RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164 Call Trace: free_pgtables+0x1b3/0x2f0 mm/memory.c:415 exit_mmap+0x2c0/0x530 mm/mmap.c:3184 __mmput+0x122/0x470 kernel/fork.c:1076 mmput+0x53/0x60 kernel/fork.c:1097 exit_mm kernel/exit.c:483 [inline] do_exit+0xa8b/0x29f0 kernel/exit.c:793 do_group_exit+0x125/0x310 kernel/exit.c:903 get_signal+0x428/0x1f00 kernel/signal.c:2757 arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811 exit_to_user_mode_loop kernel/entry/common.c:136 [inline] exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It's because the ->mmap() callback can change vma->vm_file and fput the original file. But the commit d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible") failed to catch this case and always fput() the original file, hence add an extra fput(). [ Thanks Hillf for pointing this extra fput() out. ] Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible") Reported-by: syzbot+c5d5a51dcbb558ca0cb5@syzkaller.appspotmail.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christian König <ckoenig.leichtzumerken@gmail.com> Cc: Hongxiang Lou <louhongxiang@huawei.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Dave Airlie <airlied@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: John Hubbard <jhubbard@nvidia.com> Link: https://lkml.kernel.org/r/20200916090733.31427-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: Alex Winkowski <dereference23@outlook.com>