kernel_samsung_a53x

Author	SHA1	Message	Date
Adrian Hunter	785481cfc0	perf: Prevent passing zero nr_pages to rb_alloc_aux() [ Upstream commit dbc48c8f41c208082cfa95e973560134489e3309 ] nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int, and is stored as an int. Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it will be passed to rb_alloc_aux() as zero. That is not ideal because: 1. the value is incorrect 2. rb_alloc_aux() is at risk of misbehaving, although it manages to return -ENOMEM in that case, it is a result of passing zero to get_order() even though the get_order() result is documented to be undefined in that case. Fix by simply validating the maximum supported value in the first place. Use -ENOMEM error code for consistency with the current error code that is returned in that case. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:20:08 +01:00
Adrian Hunter	c06300c46d	perf: Fix perf_aux_size() for greater-than 32-bit size [ Upstream commit 3df94a5b1078dfe2b0c03f027d018800faf44c82 ] perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to calculate a 64-bit size. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-5-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:20:08 +01:00
Ksawlii	9f899d45ad	Revert "workqueue: Make queue_rcu_work() use call_rcu_flush()" This reverts commit `d0dc26b405`.	2024-11-19 18:15:40 +01:00
Ksawlii	6f09981af2	Revert "kernel: sysctl: add init protection to common mm-related nodes" This reverts commit `7059d8baa3`.	2024-11-19 18:13:49 +01:00
Zhen Lei	9fdbd3eed2	kallsyms: Improve the performance of kallsyms_lookup_name() Currently, to search for a symbol, we need to expand the symbols in 'kallsyms_names' one by one, and then use the expanded string for comparison. It's O(n). If we sort names in ascending order like addresses, we can also use binary search. It's O(log(n)). In order not to change the implementation of "/proc/kallsyms", the table kallsyms_names[] is still stored in a one-to-one correspondence with the address in ascending order. Add array kallsyms_seqs_of_names[], it's indexed by the sequence number of the sorted names, and the corresponding content is the sequence number of the sorted addresses. For example: Assume that the index of NameX in array kallsyms_seqs_of_names[] is 'i', the content of kallsyms_seqs_of_names[i] is 'k', then the corresponding address of NameX is kallsyms_addresses[k]. The offset in kallsyms_names[] is get_symbol_offset(k). Note that the memory usage will increase by (4 * kallsyms_num_syms) bytes, the next two patches will reduce (1 * kallsyms_num_syms) bytes and properly handle the case CONFIG_LTO_CLANG=y. Performance test results: (x86) Before: min=234, max=10364402, avg=5206926 min=267, max=11168517, avg=5207587 After: min=1016, max=90894, avg=7272 min=1014, max=93470, avg=7293 The average lookup performance of kallsyms_lookup_name() improved 715x. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2024-11-19 18:06:35 +01:00
Panchajanya1999	82413308e6	power/wakelock: Add a timeout to wakelocks globally Few wakelocks tends to get stuck for no reason. Blocking them isn't necessary and sometimes blocking them breaks basic functionality. Wakelocks like "tx_swr_ctrl" tends to get stuck if we keep earphones connected and drops battery massively. Test: Keep earphones plugged in and leave device for few hours Expected result: No "tx_swr_ctrl" is being stuck. Actual result: Patch is working as expected. Change-Id: I5296990a84ab44cf6e449d6535b8b99408c415c8 Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit c721867bf4dc2e2c316b2623ad97a28382af2c8c) (cherry picked from commit a5e999ea4df99f91b7b5aa5bab5b39123587424f)	2024-11-19 18:06:07 +01:00
Sultan Alsawaf	900245cda2	schedutil: Allow CPU frequency changes to be amended before they're set If the last CPU frequency selected isn't set before a new CPU frequency selection arrives, then use the new selection immediately to avoid using a stale frequency choice. This improves both performance and energy by more closely tracking the scheduler's latest decisions. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:06:02 +01:00
Tyler Nijmeh	826d5e8824	irq: spurious: Disable IRQ debugging by default Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> Signed-off-by: sohamxda7 <sensoham135@gmail.com> Signed-off-by: Oktapra Amtono <oktapra.amtono@gmail.com> Signed-off-by: Anush02198 <Anush.4376@gmail.com> Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: NotZeetaa <rodrigo2005contente@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-11-19 18:05:57 +01:00
Sultan Alsawaf	ae0839f165	kernel: Don't allow IRQ affinity masks to have more than one CPU Even with an affinity mask that has multiple CPUs set, IRQs always run on the first CPU in their affinity mask. Drivers that register an IRQ affinity notifier (such as pm_qos) will therefore have an incorrect assumption of where an IRQ is affined. Fix the IRQ affinity mask deception by forcing it to only contain one set CPU. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:54 +01:00
Sultan Alsawaf	5d83710a9b	kernel: Only set one CPU in the default IRQ affinity mask On ARM, IRQs are executed on the first CPU inside the affinity mask, so setting an affinity mask with more than one CPU set is deceptive and causes issues with pm_qos. To fix this, only set the CPU0 bit inside the affinity mask, since that's where IRQs will run by default. This is a follow-up to "kernel: Don't allow IRQ affinity masks to have more than one CPU". Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:50 +01:00
Sultan Alsawaf	bffb1b52f3	kernel: Warn when an IRQ's affinity notifier gets overwritten An IRQ affinity notifier getting overwritten can point to some annoying issues which need to be resolved, like multiple pm_qos objects being registered to the same IRQ. Print out a warning when this happens to aid debugging. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:46 +01:00
Sultan Alsawaf	2e3484e48b	PM / freezer: Reduce freeze timeout to 1 second for Android Freezing processes on Android usually takes less than 100 ms, and if it takes longer than that to the point where the 20 second freeze timeout is reached, it's because the remaining processes to be frozen are deadlocked waiting for something from a process which is already frozen. There's no point in burning power trying to freeze for that long, so reduce the freeze timeout to a very generous 1 second for Android and don't let anything mess with it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:37 +01:00
xNombre	5d3ff5040f	alarmtimer: Minimize wakeup time Alarmtimer sets its wakeup timeout to 2s no matter the actual time to nearest timer expiration. This can cause device to be awake for more than needed. To fix this set wakeup timeout to min + 1 ms for safety margin. Tests revealed that average timer expiration is 1150ms in the future which suggests there is a room avilable to minimize wakeup times. Before this change device would enter sleep not earlier than 2s after alarmtimer suspend error (-EBUSY). With this change average suspend after alarmtimer suspend error time went down to 1.5s with a minimum of 0.248ms (after filtering results higher than 2.6s). This should lead to noticeable power savings as Android uses alarmtimer quite frequently. Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com> Signed-off-by: Zlatan Radovanovic <zlatan.radovanovic@fet.ba>	2024-11-19 18:05:33 +01:00
friedrich420	5afb8f94f1	Kernel/sched: Reduce Latency [Pafcholini] Signed-off-by: HolyAngel <slverwolf@gmail.com> Signed-off-by: Salllz <sal235222727@gmail.com> Signed-off-by: alanndz <alanndz7@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Little-W <1405481963@qq.com>	2024-11-19 18:05:31 +01:00
Yaroslav Furman	ec544c143c	PM / sleep: Skip OOM killer toggles when kernel is compiled for Android Android devices use LMK algorythms, so there's no reason to disable and enable the OOM killer when entering and exiting suspend. This is a fixed version of https://github.com/YaroST12/VIOLENT_kernel/commit/86e59a93b2ef Co-authored-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: Ren <89468157+Shirayuki39@users.noreply.github.com>	2024-11-19 18:05:27 +01:00
Sultan Alsawaf	419052d8e5	sched/fair: Compile out NUMA code entirely when NUMA is disabled Scheduler code is very hot and every little optimization counts. Instead of constantly checking sched_numa_balancing when NUMA is disabled, compile it out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:24 +01:00
Clement Courbet	d4b05cdad5	sched: Optimize __calc_delta() A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com	2024-11-19 18:05:19 +01:00
Qais Yousef	971267e87b	schedutil : cap iowait boost by uclamp_max Which is a backport of upstream fix: d37aee9018e6 ("sched/uclamp: Fix iowait boost escaping uclamp restriction") Bug: 261695814 Signed-off-by: Qais Yousef <qyousef@google.com> Change-Id: Ibe8175edb9dea35e325f1a6f4306885ab8b6b28a	2024-11-19 18:05:14 +01:00
Rohail33	ca3d31ea66	kernel: time: reduce ntp wakeups	2024-11-19 18:05:11 +01:00
Tyler Nijmeh	f40f9398a3	PM/Sleep: Start killing wakelocks after two minutes of idle (120s) Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> Signed-off-by: ThunderStorms21th nalas <pinakastorm@gmail.com>	2024-11-19 18:05:05 +01:00
Sultan Alsawaf	25da1fb9b2	qos: Don't allow userspace to impose restrictions on CPU idle levels Giving userspace intimate control over CPU latency requirements is nonsense. Userspace can't even stop itself from being preempted, so there's no reason for it to have access to a mechanism primarily used to eliminate CPU delays on the order of microseconds. Remove userspace's ability to send pm_qos requests so that it can't hurt power consumption. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev>	2024-11-19 18:05:02 +01:00
Sultan Alsawaf	74cbd01416	sched/core: Use SCHED_RR in place of SCHED_FIFO for all users Although SCHED_FIFO is a real-time scheduling policy, it can have bad results on system latency, since each SCHED_FIFO task will run to completion before yielding to another task. This can result in visible micro-stalls when a SCHED_FIFO task hogs the CPU for too long. On a system where latency is favored over throughput, using SCHED_RR is a better choice than SCHED_FIFO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Oktapra Amtono <oktapra.amtono@gmail.com> Signed-off-by: CloudedQuartz <ravenklawasd@gmail.com>	2024-11-19 18:04:58 +01:00
Sultan Alsawaf	cda8f45b3b	cpu: Silence log spam when a CPU is brought up Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: celtare21 <celtare21@gmail.com> Signed-off-by: engstk <eng.stk@sapo.pt>	2024-11-19 18:04:55 +01:00
Yaroslav Furman	e7cede92a8	sched: core: silence no longer affine to cpu logspam Signed-off-by: engstk <eng.stk@sapo.pt>	2024-11-19 18:04:49 +01:00
Sultan Alsawaf	4861626fb1	schedutil: Don't affine sugov kthreads if DVFS is allowed from any CPU Restricting sugov kthreads to their respective CPUFreq policy's CPUs slows down schedutil's ability to switch frequencies. When DVFS is allowed from any CPU, allow respective sugov kthreads to run on any CPU for better performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:04:45 +01:00
atndko	3a5f3cae8a	printk: Silence useless system log spam When charging, healthd and dashd will spam every several secs, it's sooooo noisy and useless. If you launch a userspace app, there will give a logd message, silence it. Signed-off-by: Wahid Khan <wahidzk0091@gmail.com> Signed-off-by: atndko <z1281552865@gmail.com> Signed-off-by: Vaisakh Murali <mvaisakh@statixos.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2024-11-19 18:04:40 +01:00
Sultan Alsawaf	0b24a687cf	sched: Set sched_nr_migrate back to 32 on RT for Android Android isn't a real-time userspace and has lots of processes, which makes the normal sched_nr_migrate value of 32 more appealing. In addition, there's no observed latency reduction from using a sched_nr_migrate value of 8, probably because the shallowest idle state on mobile CPUs takes longer to enter/exit than it takes for the scheduler to do a load balance run, so our tail end latency is limited by cpuidle anyway.	2024-11-19 18:04:37 +01:00
Rafael J. Wysocki	bc903594c9	cpufreq: schedutil: Reduce frequencies slower The schedutil governor reduces frequencies too fast in some situations which cases undesirable performance drops to appear. To address that issue, make schedutil reduce the frequency slower by setting it to the average of the value chosen during the previous iteration of governor computations and the new one coming from its frequency selection formula. Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963 Reported-by: John <john.ettedgui@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Cykeek <Cykeek@proton.me> Signed-off-by: negrroo <mohammedaelnaggar1@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-11-19 18:04:33 +01:00
Yaroslav Furman	04ccb84743	kernel: printk: suspend-resume stfu Signed-off-by: Yaroslav Furman <yaro330@gmail.com> Signed-off-by: Oktapra Amtono <oktapra.amtono@gmail.com> Signed-off-by: clarencelol <clarencekuiek@icloud.com> Signed-off-by: Anush02198 <Anush.4376@gmail.com> Signed-off-by: Divyanshu-Modi <divyan.m05@gmail.com> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: NotZeetaa <rodrigo2005contente@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-11-19 18:04:28 +01:00
Cyber Knight	471bfb0e50	kernel/cpu: Silence abundance of logspam We don't really need to know if the CPU is getting disabled or enabled on a production device. Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-11-19 18:04:25 +01:00
Juhyung Park	8c11745023	kernel/sys.c: implement custom uname override The uname system-call will return CONFIG_UNAME_OVERRIDE_STRING on struct new_utsname->release when a process with CONFIG_UNAME_OVERRIDE_TARGET included in its cmdline calls it. Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-11-19 17:55:01 +01:00
Sultan Alsawaf	d9e7f45cc4	arm64: Disable GENERIC_IRQ_EFFECTIVE_AFF_MASK The effective affinity mask causes a lot of bugs by virtue of many set_irq_affinity handlers only setting an effective affinity mask for an IRQ's parent but not the IRQ itself. Since this is a widespread issue that would require manual fixing on every different SoC, just disable the effective affinity mask altogether and use the first CPU in an affinity mask configured. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:54:22 +01:00
Sultan Alsawaf	07a5ef1eeb	qos: Don't disable interrupts while holding pm_qos_lock None of the pm_qos functions actually run in interrupt context; if some driver calls pm_qos_update_target in interrupt context then it's already broken. There's no need to disable interrupts while holding pm_qos_lock, so don't do it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:53:07 +01:00
Nahuel Gómez	27fe6f89a2	kernel: sched: ems: drop usage of SCHED_FEAT We removed this. ../kernel/sched/ems/core.c:1370:23: error: use of undeclared identifier 'sched_feat_names' 1370 \| index = match_string(sched_feat_names, __SCHED_FEAT_NR, "TTWU_QUEUE"); \| ^ ../kernel/sched/ems/core.c:1370:41: error: use of undeclared identifier '__SCHED_FEAT_NR' 1370 \| index = match_string(sched_feat_names, __SCHED_FEAT_NR, "TTWU_QUEUE"); \| ^ ../kernel/sched/ems/core.c:1372:23: error: use of undeclared identifier 'sched_feat_keys' 1372 \| static_key_disable(&sched_feat_keys[index]); \| ^ ../kernel/sched/ems/core.c:1373:3: error: use of undeclared identifier 'sysctl_sched_features'; did you mean 'sysctl_sched_latency'? 1373 \| sysctl_sched_features &= ~(1UL << index); \| ^~~~~~~~~~~~~~~~~~~~~ \| sysctl_sched_latency ../include/linux/sched/sysctl.h:29:21: note: 'sysctl_sched_latency' declared here 29 \| extern unsigned int sysctl_sched_latency; \| ^ 4 errors generated. Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-19 17:52:14 +01:00
Minchan Kim	f19a9560cc	locking/rwlocks: introduce write_lock_nested In preparation for converting bit_spin_lock to rwlock in zsmalloc so that multiple writers of zspages can run at the same time but those zspages are supposed to be different zspage instance. Thus, it's not deadlock. This patch adds write_lock_nested to support the case for LOCKDEP. [minchan@kernel.org: fix write_lock_nested for RT] Link: https://lkml.kernel.org/r/YZfrMTAXV56HFWJY@google.com [bigeasy@linutronix.de: fixup write_lock_nested() implementation] Link: https://lkml.kernel.org/r/20211123170134.y6xb7pmpgdn4m3bn@linutronix.de Link: https://lkml.kernel.org/r/20211115185909.3949505-8-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-11-19 17:44:05 +01:00
Sultan Alsawaf	d4bbaf5715	sched/core: Forbid Unity-based games from changing their CPU affinity Unity-based games (such as Wild Rift) like to shoot themselves in the foot by setting a nonsense CPU affinity, restricting the game to a narrow set of CPU cores that it thinks are the "big" cores in a heterogeneous CPU. It assumes that CPUs only have two performance domains (clusters), and therefore royally mucks up games' CPU affinities on CPUs which have more than two performance domains. Check if a setaffinity target task is part of a Unity-based game and silently ignore the setaffinity request so that it can't sabotage itself. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:43:59 +01:00
Johannes Weiner	90c0c9aa4a	cgroup: rstat: punt root-level optimization to individual controllers Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. This is the case for the io controller and the cgroup base stats. In their respective flush callbacks, check whether the parent is the root cgroup, and if so, skip the unnecessary upward propagation. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit dc26532aed0ab25c0801a34640d1f3b9b9098a48) (cherry picked from commit 69da183fcd0112af130879a1c93113a941e2241b) (cherry picked from commit ddf1013871482b246147e71a04c865c1be5cf74d) (cherry picked from commit 30fcd52e18dd1d508b1b22f7c660ac22de734f67) (cherry picked from commit 19c9a1b9d9ae9a4f359deaf89101f9013254f43d) (cherry picked from commit 0b4286aea9bb0a6ea6acb723f8396e476044190b)	2024-11-19 17:40:21 +01:00
Nahuel Gómez	7059d8baa3	kernel: sysctl: add init protection to common mm-related nodes The protected nodes are: * dirty_ratio * dirty_background_ratio * dirty_bytes * dirty_background_bytes * dirty_expire_centisecs * dirty_writeback_centisecs * swappiness This approach is inspired by [1] and makes use of the node tampering blacklist. [1]: `239efdc263` Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-19 17:39:17 +01:00
Uladzislau Rezki	d0dc26b405	workqueue: Make queue_rcu_work() use call_rcu_flush() Earlier commits in this series allow battery-powered systems to build their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option. This Kconfig option causes call_rcu() to delay its callbacks in order to batch them. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_flush() instead of call_rcu(). The arrival of a non-lazy call_rcu_flush() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_flush() for the few places where laziness is inappropriate. And another call_rcu() instance that cannot be lazy is the one in queue_rcu_work(), given that callers to queue_rcu_work() are not necessarily OK with long delays. Therefore, make queue_rcu_work() use call_rcu_flush() in order to revert to the old behavior. Signed-off-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2024-11-19 17:37:56 +01:00
Sultan Alsawaf	fa6b06bf46	sched/fair: Always update CPU capacity when load balancing Limiting CPU capacity updates, which are quite cheap, results in worse balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE). This causes opportunistic placement decisions to be skewed using stale CPU capacity data, and when a CPU isn't idling much, its capacity suffers from even more staleness since the only exception to the 100 ms capacity update ratelimit is a CPU exiting idle. Since the capacity updates are cheap, always do it when load balancing in order to improve opportunistic task placement decisions. Change-Id: If1d451ce742fd093010057e31e71012d47fad70a Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:34:49 +01:00
Joel Fernandes (Google)	323a4009a4	rcu: Avoid unnecessary softirq when system is idle When there are no callbacks pending on an idle system, I noticed that RCU softirq is continuously firing. During this the cpu_no_qs is set to false, and core_needs_qs is set to true indefinitely. This causes rcu_process_callbacks to be repeatedly called, even though the node corresponding to the CPU has that CPU's mask bit cleared and the system is idle. I believe the race is when such mask clearing is done during idle CPU scan of the quiescent state forcing stage in the kthread instead of the softirq. Since the rnp mask is cleared, but the flags on the CPU's rdp are not cleared, the CPU thinks it still needs to report to core RCU. Cure this by clearing the core_needs_qs flag when the CPU detects that its node is already updated which will avoid the unwanted softirq raises to the benefit of real-time systems. Test: Ran rcutorture for various tree RCU configs. Change-Id: Iee374d1dcdc74ecc5e6816a99be51feddd876931 Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: mydongistiny <jaysonedson@gmail.com>	2024-11-19 17:34:20 +01:00
Daniel Borkmann	33c88d138d	bpf: Fix overrunning reservations in ringbuf commit cfa1a2329a691ffd991fcf7248a57d752e712881 upstream. The BPF ring buffer internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters: consumer_pos is the consumer counter to show which logical position the consumer consumed the data, and producer_pos which is the producer counter denoting the amount of data reserved by all producers. Each time a record is reserved, the producer that "owns" the record will successfully advance producer counter. In user space each time a record is read, the consumer of the data advanced the consumer counter once it finished processing. Both counters are stored in separate pages so that from user space, the producer counter is read-only and the consumer counter is read-write. One aspect that simplifies and thus speeds up the implementation of both producers and consumers is how the data area is mapped twice contiguously back-to-back in the virtual memory, allowing to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for book-keeping the length and offset, and is inaccessible to the BPF program. Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for the BPF program to use. Bing-Jhong and Muhammad reported that it is however possible to make a second allocated memory chunk overlapping with the first chunk and as a result, the BPF program is now able to edit first chunk's header. For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets allocate a chunk B with size 0x3000. This will succeed because consumer_pos was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask` check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data pages. This means that chunk B at [0x4000,0x4008] is chunk A's header. bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong page and could cause a crash. Fix it by calculating the oldest pending_pos and check whether the range from the oldest outstanding record to the newest would span beyond the ring buffer size. If that is the case, then reject the request. We've tested with the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh) before/after the fix and while it seems a bit slower on some benchmarks, it is still not significantly enough to matter. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg> Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Co-developed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 14:19:51 +01:00
Eduard Zingerman	12ebd1d34e	bpf: Allow reads from uninit stack commit 6715df8d5d24655b9fd368e904028112b54c7de1 upstream. This commits updates the following functions to allow reads from uninitialized stack locations when env->allow_uninit_stack option is enabled: - check_stack_read_fixed_off() - check_stack_range_initialized(), called from: - check_stack_read_var_off() - check_helper_mem_access() Such change allows to relax logic in stacksafe() to treat STACK_MISC and STACK_INVALID in a same way and make the following stack slot configurations equivalent: \| Cached state \| Current state \| \| stack slot \| stack slot \| \|------------------+------------------\| \| STACK_INVALID or \| STACK_INVALID or \| \| STACK_MISC \| STACK_SPILL or \| \| \| STACK_MISC or \| \| \| STACK_ZERO or \| \| \| STACK_DYNPTR \| This leads to significant verification speed gains (see below). The idea was suggested by Andrii Nakryiko [1] and initial patch was created by Alexei Starovoitov [2]. Currently the env->allow_uninit_stack is allowed for programs loaded by users with CAP_PERFMON or CAP_SYS_ADMIN capabilities. A number of test cases from verifier/.c were expecting uninitialized stack access to be an error. These test cases were updated to execute in unprivileged mode (thus preserving the tests). The test progs/test_global_func10.c expected "invalid indirect read from stack" error message because of the access to uninitialized memory region. This error is no longer possible in privileged mode. The test is updated to provoke an error "invalid indirect access to stack" because of access to invalid stack address (such error is not verified by progs/test_global_func.c series of tests). The following tests had to be removed because these can't be made unprivileged: - verifier/sock.c: - "sk_storage_get(map, skb->sk, &stack_value, 1): partially init stack_value" BPF_PROG_TYPE_SCHED_CLS programs are not executed in unprivileged mode. - verifier/var_off.c: - "indirect variable-offset stack access, max_off+size > max_initialized" - "indirect variable-offset stack access, uninitialized" These tests verify that access to uninitialized stack values is detected when stack offset is not a constant. However, variable stack access is prohibited in unprivileged mode, thus these tests are no longer valid. * * * Here is veristat log comparing this patch with current master on a set of selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg and cilium BPF binaries (see [3]): $ ./veristat -e file,prog,states -C -f 'states_pct<-30' master.log current.log File Program States (A) States (B) States (DIFF) -------------------------- -------------------------- ---------- ---------- ---------------- bpf_host.o tail_handle_ipv6_from_host 349 244 -105 (-30.09%) bpf_host.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%) bpf_lxc.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%) bpf_sock.o cil_sock4_connect 70 48 -22 (-31.43%) bpf_sock.o cil_sock4_sendmsg 68 46 -22 (-32.35%) bpf_xdp.o tail_handle_nat_fwd_ipv4 1554 803 -751 (-48.33%) bpf_xdp.o tail_lb_ipv4 6457 2473 -3984 (-61.70%) bpf_xdp.o tail_lb_ipv6 7249 3908 -3341 (-46.09%) pyperf600_bpf_loop.bpf.o on_event 287 145 -142 (-49.48%) strobemeta.bpf.o on_event 15915 4772 -11143 (-70.02%) strobemeta_nounroll2.bpf.o on_event 17087 3820 -13267 (-77.64%) xdp_synproxy_kern.bpf.o syncookie_tc 21271 6635 -14636 (-68.81%) xdp_synproxy_kern.bpf.o syncookie_xdp 23122 6024 -17098 (-73.95%) -------------------------- -------------------------- ---------- ---------- ---------------- Note: I limited selection by states_pct<-30%. Inspection of differences in pyperf600_bpf_loop behavior shows that the following patch for the test removes almost all differences: - a/tools/testing/selftests/bpf/progs/pyperf.h + b/tools/testing/selftests/bpf/progs/pyperf.h @ -266,8 +266,8 @ int __on_event(struct bpf_raw_tracepoint_args ctx) } if (event->pthread_match \|\| !pidData->use_tls) { - void frame_ptr; - FrameData frame; + void* frame_ptr = 0; + FrameData frame = {}; Symbol sym = {}; int cur_cpu = bpf_get_smp_processor_id(); W/o this patch the difference comes from the following pattern (for different variables): static bool get_frame_data(... FrameData frame ...) { ... bpf_probe_read_user(&frame->f_code, ...); if (!frame->f_code) return false; ... bpf_probe_read_user(&frame->co_name, ...); if (frame->co_name) ...; } int __on_event(struct bpf_raw_tracepoint_args ctx) { FrameData frame; ... get_frame_data(... &frame ...) // indirectly via a bpf_loop & callback ... } SEC("raw_tracepoint/kfree_skb") int on_event(struct bpf_raw_tracepoint_args* ctx) { ... ret \|= __on_event(ctx); ret \|= __on_event(ctx); ... } With regards to value `frame->co_name` the following is important: - Because of the conditional `if (!frame->f_code)` each call to __on_event() produces two states, one with `frame->co_name` marked as STACK_MISC, another with it as is (and marked STACK_INVALID on a first call). - The call to bpf_probe_read_user() does not mark stack slots corresponding to `&frame->co_name` as REG_LIVE_WRITTEN but it marks these slots as BPF_MISC, this happens because of the following loop in the check_helper_call(): for (i = 0; i < meta.access_size; i++) { err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B, BPF_WRITE, -1, false); if (err) return err; } Note the size of the write, it is a one byte write for each byte touched by a helper. The BPF_B write does not lead to write marks for the target stack slot. - Which means that w/o this patch when second __on_event() call is verified `if (frame->co_name)` will propagate read marks first to a stack slot with STACK_MISC marks and second to a stack slot with STACK_INVALID marks and these states would be considered different. [1] https://lore.kernel.org/bpf/CAEf4BzY3e+ZuC6HUa8dCiUovQRg2SzEk7M-dSkqNZyn=xEmnPA@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/ [3] git@github.com:anakryiko/cilium.git Suggested-by: Andrii Nakryiko <andrii@kernel.org> Co-developed-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230219200427.606541-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 14:19:46 +01:00
GUO Zihua	f645f11672	ima: Avoid blocking in RCU read-side critical section commit 9a95c5bfbf02a0a7f5983280fe284a0ff0836c34 upstream. A panic happens in ima_match_policy: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 PGD 42f873067 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 5 PID: 1286325 Comm: kubeletmonit.sh Kdump: loaded Tainted: P Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 RIP: 0010:ima_match_policy+0x84/0x450 Code: 49 89 fc 41 89 cf 31 ed 89 44 24 14 eb 1c 44 39 7b 18 74 26 41 83 ff 05 74 20 48 8b 1b 48 3b 1d f2 b9 f4 00 0f 84 9c 01 00 00 <44> 85 73 10 74 ea 44 8b 6b 14 41 f6 c5 01 75 d4 41 f6 c5 02 74 0f RSP: 0018:ff71570009e07a80 EFLAGS: 00010207 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000200 RDX: ffffffffad8dc7c0 RSI: 0000000024924925 RDI: ff3e27850dea2000 RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffabfce739 R10: ff3e27810cc42400 R11: 0000000000000000 R12: ff3e2781825ef970 R13: 00000000ff3e2785 R14: 000000000000000c R15: 0000000000000001 FS: 00007f5195b51740(0000) GS:ff3e278b12d40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000626d24002 CR4: 0000000000361ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ima_get_action+0x22/0x30 process_measurement+0xb0/0x830 ? page_add_file_rmap+0x15/0x170 ? alloc_set_pte+0x269/0x4c0 ? prep_new_page+0x81/0x140 ? simple_xattr_get+0x75/0xa0 ? selinux_file_open+0x9d/0xf0 ima_file_check+0x64/0x90 path_openat+0x571/0x1720 do_filp_open+0x9b/0x110 ? page_counter_try_charge+0x57/0xc0 ? files_cgroup_alloc_fd+0x38/0x60 ? __alloc_fd+0xd4/0x250 ? do_sys_open+0x1bd/0x250 do_sys_open+0x1bd/0x250 do_syscall_64+0x5d/0x1d0 entry_SYSCALL_64_after_hwframe+0x65/0xca Commit c7423dbdbc9e ("ima: Handle -ESTALE returned by ima_filter_rule_match()") introduced call to ima_lsm_copy_rule within a RCU read-side critical section which contains kmalloc with GFP_KERNEL. This implies a possible sleep and violates limitations of RCU read-side critical sections on non-PREEMPT systems. Sleeping within RCU read-side critical section might cause synchronize_rcu() returning early and break RCU protection, allowing a UAF to happen. The root cause of this issue could be described as follows: \| Thread A \| Thread B \| \| \|ima_match_policy \| \| \| rcu_read_lock \| \|ima_lsm_update_rule \| \| \| synchronize_rcu \| \| \| \| kmalloc(GFP_KERNEL)\| \| \| sleep \| ==> synchronize_rcu returns early \| kfree(entry) \| \| \| \| entry = entry->next\| ==> UAF happens and entry now becomes NULL (or could be anything). \| \| entry->action \| ==> Accessing entry might cause panic. To fix this issue, we are converting all kmalloc that is called within RCU read-side critical section to use GFP_ATOMIC. Fixes: c7423dbdbc9e ("ima: Handle -ESTALE returned by ima_filter_rule_match()") Cc: stable@vger.kernel.org Signed-off-by: GUO Zihua <guozihua@huawei.com> Acked-by: John Johansen <john.johansen@canonical.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Reviewed-by: Casey Schaufler <casey@schaufler-ca.com> [PM: fixed missing comment, long lines, !CONFIG_IMA_LSM_RULES case] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 14:19:42 +01:00
Jinliang Zheng	1b455ad8b9	mm: optimize the redundant loop of mm_update_owner_next() commit cf3f9a593dab87a032d2b6a6fb205e7f3de4f0a1 upstream. When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or /proc or ptrace or page migration (get_task_mm()), it is impossible to find an appropriate task_struct in the loop whose mm_struct is the same as the target mm_struct. If the above race condition is combined with the stress-ng-zombie and stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in write_lock_irq() for tasklist_lock. Recognize this situation in advance and exit early. Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 14:19:41 +01:00
Arnd Bergmann	e0221b0a4a	syscalls: fix compat_sys_io_pgetevents_time64 usage commit d3882564a77c21eb746ba5364f3fa89b88de3d61 upstream. Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 14:19:34 +01:00
Haifeng Xu	c52e3af387	perf/core: Fix missing wakeup when waiting for context reference [ Upstream commit 74751ef5c1912ebd3e65c3b65f45587e05ce5d36 ] In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this: [346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem. Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ... In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference. Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240513103948.33570-1-haifeng.xu@shopee.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 14:19:30 +01:00
Matthias Maennich	4070b454f4	kheaders: explicitly define file modes for archived headers [ Upstream commit 3bd27a847a3a4827a948387cc8f0dbc9fa5931d5 ] Build environments might be running with different umask settings resulting in indeterministic file modes for the files contained in kheaders.tar.xz. The file itself is served with 444, i.e. world readable. Archive the files explicitly with 744,a+X to improve reproducibility across build environments. --mode=0444 is not suitable as directories need to be executable. Also, 444 makes it hard to delete all the readonly files after extraction. Cc: stable@vger.kernel.org Signed-off-by: Matthias Maennich <maennich@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 14:19:30 +01:00
Masahiro Yamada	bd0a2fbc37	Revert "kheaders: substituting --sort in archive creation" [ Upstream commit 49c386ebbb43394ff4773ce24f726f6afc4c30c8 ] This reverts commit 700dea5a0bea9f64eba89fae7cb2540326fdfdc1. The reason for that commit was --sort=ORDER introduced in tar 1.28 (2014). More than 3 years have passed since then. Requiring GNU tar 1.28 should be fine now because we require GCC 5.1 (2015). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Stable-dep-of: 3bd27a847a3a ("kheaders: explicitly define file modes for archived headers") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 14:19:29 +01:00
Jeff Johnson	119c3632ba	tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test [ Upstream commit 23748e3e0fbfe471eff5ce439921629f6a427828 ] Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/trace/preemptirq_delay_test.o Link: https://lore.kernel.org/linux-trace-kernel/20240518-md-preemptirq_delay_test-v1-1-387d11b30d85@quicinc.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: f96e8577da10 ("lib: Add module for testing preemptoff/irqsoff latency tracers") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 14:19:29 +01:00

1 2 3 4 5

230 commits