kernel_samsung_a53x

Author	SHA1	Message	Date
Ksawlii	f6ee50f616	Reapply "sched: core: Minimize number of tasks to load balance" This reverts commit `d03ac752be`.	2024-12-18 11:24:44 +01:00
Ksawlii	12ce49e196	Reapply "timekeeping: Keep the tick alive when CPUs cycle out of s2idle" This reverts commit `0a04e29636`.	2024-12-18 11:24:12 +01:00
Ksawlii	febb7ecbd1	Revert "sysctl: promote several nodes out of CONFIG_SCHED_DEBUG" This reverts commit `26944181d5`.	2024-12-18 11:09:28 +01:00
Ksawlii	8ce8f6324e	Revert "fs,kernel,mm: tune to Ktweak balance" This reverts commit `50e7a3b302`.	2024-12-18 11:09:11 +01:00
Ksawlii	48c33cf21c	Revert "sched: core: Minimize number of tasks to load balance" This reverts commit `e983caf81d`.	2024-12-18 11:07:27 +01:00
Ksawlii	58e2d66f91	Revert "fs,kernel,mm: tune to Ktweak balance" This reverts commit `50e7a3b302`.	2024-12-18 09:40:48 +01:00
Ksawlii	0a04e29636	Revert "timekeeping: Keep the tick alive when CPUs cycle out of s2idle" This reverts commit `3af8699aee`.	2024-12-18 09:38:06 +01:00
Ksawlii	d03ac752be	Revert "sched: core: Minimize number of tasks to load balance" This reverts commit `e983caf81d`.	2024-12-18 00:32:20 +01:00
Ksawlii	f37c1e4577	Revert "smp: Use migrate disable/enable in smp_call_function_single_async()" This reverts commit `b6764b2064`.	2024-12-18 00:32:11 +01:00
Ksawlii	613c9dd5f1	Revert "rcu: Make the grace period workers unbound again" This reverts commit `27bb889ce3`.	2024-12-18 00:32:04 +01:00
Ksawlii	ea96a0db96	Revert "sysctl: promote several nodes out of CONFIG_SCHED_DEBUG" This reverts commit `26944181d5`.	2024-12-18 00:17:01 +01:00
Ksawlii	cac147d1b1	Revert "PM / freezer: Reduce freeze timeout to 1 second for Android" This reverts commit `a163b49c80`.	2024-12-17 23:31:21 +01:00
Ksawlii	825fa0d2ea	Revert "sched/fair: apply init protection" This reverts commit `1811b5b3e5`.	2024-12-17 21:14:50 +01:00
Ksawlii	2023e9d3a9	kernel: core.c: Forgot to remove HEAD	2024-12-17 21:11:04 +01:00
Ksawlii	15afa770f5	Revert "fs,kernel,mm: tune to Ktweak balance" This reverts commit `50e7a3b302`.	2024-12-17 20:58:55 +01:00
Tyler Nijmeh	e983caf81d	sched: core: Minimize number of tasks to load balance We don't want sched_nr_migrate to be too high, as that would impact real-time latencies as the load balancing process for SCHED_OTHER disallows IRQ interrupts. Instead of making this a constant value, let's use an amount that relates to the CPU performance of a device. Consider that a device with a very weak 4 core processor needs to load balance 32 tasks from the most busy-but-idle CPU. That balancing would take significantly longer than a device with 8 cores. On the contrary view, that same 8 core device balancing 32 tasks would have no issues with SCHED_OTHER performance, however, real-time tasks would suffer more jitter than is necessary. Let's only balance as many tasks as there are CPUs in a device for optimal SCHED_OTHER performance and SCHED_FIFO / SCHED_RR latency.	2024-12-17 20:32:22 +01:00
Sultan Alsawaf	a163b49c80	PM / freezer: Reduce freeze timeout to 1 second for Android Freezing processes on Android usually takes less than 100 ms, and if it takes longer than that to the point where the 20 second freeze timeout is reached, it's because the remaining processes to be frozen are deadlocked waiting for something from a process which is already frozen. There's no point in burning power trying to freeze for that long, so reduce the freeze timeout to a very generous 1 second for Android and don't let anything mess with it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit 72fc44bd8bcfb6d1cbaafb2053173cd985d81d0c) (cherry picked from commit 7097f6cabd228cf426da77ee71088e56beba3cd0) (cherry picked from commit 4f1c2d26818d846c0aea6713652e2a0eaa6a23c4) (cherry picked from commit d416b35f8187c0605ac8816e4d771288514876f1)	2024-12-17 20:32:18 +01:00
Sultan Alsawaf	3af8699aee	timekeeping: Keep the tick alive when CPUs cycle out of s2idle When some CPUs cycle out of s2idle due to non-wakeup IRQs, it's possible for them to run while the CPU responsible for jiffies updates remains idle. This can delay the execution of timers indefinitely until the CPU managing the jiffies updates finally wakes up, by which point everything could be dead if enough time passes. Fix it by handing off timekeeping duties when the timekeeping CPU enters s2idle and freezes its tick. When all CPUs are in s2idle, the first one to wake up for any reason (either from a wakeup IRQ or non-wakeup IRQ) will assume responsibility for the timekeeping tick. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit 717100653a78c63fe56b95721fffee5fad96ee91) (cherry picked from commit 83196f829f9e2d4aef6a4d30ac449a7bfc985208) (cherry picked from commit ab251b18b95c2fe80f457dabdf2e7132a6b0ea27) (cherry picked from commit e58369f386e19f02bc5db1a155040e54dd201a60)	2024-12-17 20:32:12 +01:00
Sultan Alsawaf	b6764b2064	smp: Use migrate disable/enable in smp_call_function_single_async() Asynchronous IPI users must already handle csd object lifetimes on their own, so there's no need to prevent re-entrancy on a single CPU inside smp_call_function_single_async(). As such, smp_call_function_single_async() can be made more RT-friendly by just using migrate enable/disable instead of disabling preemption, since preventing migration is all that's needed. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit 9983455a67226511e036f19f8725cd54bac10aa1) (cherry picked from commit a792f43200afd6594a1e07b3cb11d048a9ec1218) (cherry picked from commit f2d0b87c64e5a1e7b73430950f92d7a64d39c964) (cherry picked from commit 0076a73f19418f3f7867b6ce2fa93f46105458c5)	2024-12-17 20:32:08 +01:00
Sultan Alsawaf	27bb889ce3	rcu: Make the grace period workers unbound again After a dedicated grace-period workqueue was added to RCU in order to benefit from rescuer threads, the relevant workers were moved to the new workqueue away from system_power_efficient_wq. The old workqueue was unbound, which is desirable for performance reasons. Making the workers bound measurably regressed performance, so make them unbound again. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit 139be6da6ee92c8b0aa2d19081fa230435082054) (cherry picked from commit 3428cd70087fe5fd538d96e89f8f12266aacc2d1) (cherry picked from commit b4e1beb3e73ccd6f341f95020becfff0b7139745) (cherry picked from commit 428f05ee3318ed4314317bb697abec26a4bcc930)	2024-12-17 20:31:57 +01:00
Nahuel Gómez	50e7a3b302	fs,kernel,mm: tune to Ktweak balance Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-17 20:30:06 +01:00
Nahuel Gómez	1811b5b3e5	sched/fair: apply init protection Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-17 20:28:06 +01:00
Park Ju Hyung	26944181d5	sysctl: promote several nodes out of CONFIG_SCHED_DEBUG These are used in Android. Promote these to disable CONFIG_SCHED_DEBUG. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> [0ctobot: Adapted for 4.19] Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Change-Id: I8053176882e155926769939de15da375e7d548a0	2024-12-17 20:27:04 +01:00
K Prateek Nayak	e6fdbec5ef	sched/core: Prevent wakeup of ksoftirqd during idle load balance [ Upstream commit e932c4ab38f072ce5894b2851fea8bc5754bb8e5 ] Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on from the IPI handler on the idle CPU. If the SMP function is invoked from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd because soft interrupts are handled before ksoftirqd get on the CPU. Adding a trace_printk() in nohz_csd_func() at the spot of raising SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup, and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the current behavior: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func <idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000 <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED] <idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120 ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 ... Use __raise_softirq_irqoff() to raise the softirq. The SMP function call is always invoked on the requested CPU in an interrupt handler. It is guaranteed that soft interrupts are handled at the end. Following are the observations with the changes when enabling the same set of events: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance <idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] No unnecessary ksoftirqd wakeups are seen from idle task's context to service the softirq. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1] Reported-by: Julia Lawall <julia.lawall@inria.fr> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
K Prateek Nayak	d4d43810f2	sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy [ Upstream commit ff47a0acfcce309cf9e175149c75614491953c8f ] Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the TIF_NEED_RESCHED flag in idle task's thread info and relying on flush_smp_call_function_queue() in idle exit path to run the call-function. A softirq raised by the call-function is handled shortly after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag remains set and is only cleared later when schedule_idle() calls __schedule(). need_resched() check in _nohz_idle_balance() exists to bail out of load balancing if another task has woken up on the CPU currently in-charge of idle load balancing which is being processed in SCHED_SOFTIRQ context. Since the optimization mentioned above overloads the interpretation of TIF_NEED_RESCHED, check for idle_cpu() before going with the existing need_resched() check which can catch a genuine task wakeup on an idle CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as well as the case where ksoftirqd needs to be preempted as a result of new task wakeup or slice expiry. In case of PREEMPT_RT or threadirqs, although the idle load balancing may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd is the only fair task going back to sleep will trigger a newidle balance on the CPU which will alleviate some imbalance if it exists if idle balance fails to do so. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Valentin Schneider	8a8ef40c42	sched/fair: Add NOHZ balancer flag for nohz.next_balance updates [ Upstream commit efd984c481abb516fab8bafb25bf41fd9397a43c ] A following patch will trigger NOHZ idle balances as a means to update nohz.next_balance. Vincent noted that blocked load updates can have non-negligible overhead, which should be avoided if the intent is to only update nohz.next_balance. Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is expected. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	ab620a407a	sched/fair: Trigger the update of blocked load on newly idle cpu [ Upstream commit c6f886546cb8a38617cdbe755fe50d3acd2463e4 ] Instead of waking up a random and already idle CPU, we can take advantage of this_cpu being about to enter idle to run the ILB and update the blocked load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	9ae9714a14	sched/fair: Merge for each idle cpu loop of ILB [ Upstream commit 7a82e5f52a3506bc35a4dc04d53ad2c9daf82e7f ] Remove the specific case for handling this_cpu outside for_each_cpu() loop when running ILB. Instead we use for_each_cpu_wrap() and start with the next cpu after this_cpu so we will continue to finish with this_cpu. update_nohz_stats() is now used for this_cpu too and will prevents unnecessary update. We don't need a special case for handling the update of nohz.next_balance for this_cpu anymore because it is now handled by the loop like others. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-5-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	4ae526c326	sched/fair: Remove unused parameter of update_nohz_stats [ Upstream commit 64f84f273592d17dcdca20244168ad9f525a39c3 ] idle load balance is the only user of update_nohz_stats and doesn't use force parameter. Remove it Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-4-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	fe0cdb4e3f	sched/fair: Remove update of blocked load from newidle_balance [ Upstream commit 0826530de3cbdc89e60a89e86def94a5f0fc81ca ] newidle_balance runs with both preempt and irq disabled which prevent local irq to run during this period. The duration for updating the blocked load of CPUs varies according to the number of CPU cgroups with non-decayed load and extends this critical period to an uncontrolled level. Remove the update from newidle_balance and trigger a normal ILB that will take care of the update instead. This reduces the IRQ latency from O(nr_cgroups * nr_nohz_cpus) to O(nr_cgroups). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-2-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
K Prateek Nayak	d08554baac	sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() [ Upstream commit ea9cffc0a154124821531991d5afdd7e8b20d7aa ] The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance") Since then, it has travelled quite a bit but it seems like an idle_cpu() check currently is sufficient to detect the need to bail out from an idle load balancing. To justify this removal, consider all the following case where an idle load balancing could race with a task wakeup: o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") a target perceived to be idle (target_rq->nr_running == 0) will return true for ttwu_queue_cond(target) which will offload the task wakeup to the idle target via an IPI. In all such cases target_rq->ttwu_pending will be set to 1 before queuing the wake function. If an idle load balance races here, following scenarios are possible: - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual IPI is sent to the CPU to wake it out of idle. If the nohz_csd_func() queues before sched_ttwu_pending(), the idle load balance will bail out since idle_cpu(target) returns 0 since target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after sched_ttwu_pending() it should see rq->nr_running to be non-zero and bail out of idle load balancing. - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI, the sender will simply set TIF_NEED_RESCHED for the target to put it out of idle and flush_smp_call_function_queue() in do_idle() will execute the call function. Depending on the ordering of the queuing of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in nohz_csd_func() should either see target_rq->ttwu_pending = 1 or target_rq->nr_running to be non-zero if there is a genuine task wakeup racing with the idle load balance kick. o The waker CPU perceives the target CPU to be busy (targer_rq->nr_running != 0) but the CPU is in fact going idle and due to a series of unfortunate events, the system reaches a case where the waker CPU decides to perform the wakeup by itself in ttwu_queue() on the target CPU but target is concurrently selected for idle load balance (XXX: Can this happen? I'm not sure, but we'll consider the mother of all coincidences to estimate the worst case scenario). ttwu_do_activate() calls enqueue_task() which would increment "rq->nr_running" post which it calls wakeup_preempt() which is responsible for setting TIF_NEED_RESCHED (via a resched IPI or by setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key thing to note in this case is that rq->nr_running is already non-zero in case of a wakeup before TIF_NEED_RESCHED is set which would lead to idle_cpu() check returning false. In all cases, it seems that need_resched() check is unnecessary when checking for idle_cpu() first since an impending wakeup racing with idle load balancer will either set the "rq->ttwu_pending" or indicate a newly woken task via "rq->nr_running". Chasing the reason why this check might have existed in the first place, I came across Peter's suggestion on the fist iteration of Suresh's patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was: sched_ttwu_do_pending(list); if (unlikely((rq->idle == current) && rq->nohz_balance_kick && !need_resched())) raise_softirq_irqoff(SCHED_SOFTIRQ); Since the condition to raise the SCHED_SOFIRQ was preceded by sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in the current upstream kernel, the need_resched() check was necessary to catch a newly queued task. Peter suggested modifying it to: if (idle_cpu() && rq->nohz_balance_kick && !need_resched()) raise_softirq_irqoff(SCHED_SOFTIRQ); where idle_cpu() seems to have replaced "rq->idle == current" check. Even back then, the idle_cpu() check would have been sufficient to catch a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") overloads the interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based on Peter's suggestion. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Uros Bizjak	7870fff3b6	tracing: Use atomic64_inc_return() in trace_clock_counter() [ Upstream commit eb887c4567d1b0e7684c026fe7df44afa96589e6 ] Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241007085651.48544-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:31 +01:00
Levi Yun	e992d99dcd	dma-debug: fix a possible deadlock on radix_lock [ Upstream commit 7543c3e3b9b88212fcd0aaf5cab5588797bdc7de ] radix_lock() shouldn't be held while holding dma_hash_entry[idx].lock otherwise, there's a possible deadlock scenario when dma debug API is called holding rq_lock(): CPU0 CPU1 CPU2 dma_free_attrs() check_unmap() add_dma_entry() __schedule() //out (A) rq_lock() get_hash_bucket() (A) dma_entry_hash check_sync() (A) radix_lock() (W) dma_entry_hash dma_entry_free() (W) radix_lock() // CPU2's one (W) rq_lock() CPU1 situation can happen when it extending radix tree and it tries to wake up kswapd via wake_all_kswapd(). CPU2 situation can happen while perf_event_task_sched_out() (i.e. dma sync operation is called while deleting perf_event using etm and etr tmc which are Arm Coresight hwtracing driver backends). To remove this possible situation, call dma_entry_free() after put_hash_bucket() in check_unmap(). Reported-by: Denis Nikitin <denik@chromium.org> Closes: https://lists.linaro.org/archives/list/coresight@lists.linaro.org/thread/2WMS7BBSF5OZYB63VT44U5YWLFP5HL6U/#RWM6MLQX5ANBTEQ2PRM7OXCBGCE6NPWU Signed-off-by: Levi Yun <yeoreum.yun@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:30 +01:00
Marco Elver	148c1e84e1	kcsan: Turn report_filterlist_lock into a raw_spinlock [ Upstream commit 59458fa4ddb47e7891c61b4a928d13d5f5b00aa0 ] Ran Xiaokai reports that with a KCSAN-enabled PREEMPT_RT kernel, we can see splats like: \| BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 \| in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 \| preempt_count: 10002, expected: 0 \| RCU nest depth: 0, expected: 0 \| no locks held by swapper/1/0. \| irq event stamp: 156674 \| hardirqs last enabled at (156673): [<ffffffff81130bd9>] do_idle+0x1f9/0x240 \| hardirqs last disabled at (156674): [<ffffffff82254f84>] sysvec_apic_timer_interrupt+0x14/0xc0 \| softirqs last enabled at (0): [<ffffffff81099f47>] copy_process+0xfc7/0x4b60 \| softirqs last disabled at (0): [<0000000000000000>] 0x0 \| Preemption disabled at: \| [<ffffffff814a3e2a>] paint_ptr+0x2a/0x90 \| CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.11.0+ #3 \| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 \| Call Trace: \| <IRQ> \| dump_stack_lvl+0x7e/0xc0 \| dump_stack+0x1d/0x30 \| __might_resched+0x1a2/0x270 \| rt_spin_lock+0x68/0x170 \| kcsan_skip_report_debugfs+0x43/0xe0 \| print_report+0xb5/0x590 \| kcsan_report_known_origin+0x1b1/0x1d0 \| kcsan_setup_watchpoint+0x348/0x650 \| __tsan_unaligned_write1+0x16d/0x1d0 \| hrtimer_interrupt+0x3d6/0x430 \| __sysvec_apic_timer_interrupt+0xe8/0x3a0 \| sysvec_apic_timer_interrupt+0x97/0xc0 \| </IRQ> On a detected data race, KCSAN's reporting logic checks if it should filter the report. That list is protected by the report_filterlist_lock non-raw spinlock which may sleep on RT kernels. Since KCSAN may report data races in any context, convert it to a raw_spinlock. This requires being careful about when to allocate memory for the filter list itself which can be done via KCSAN's debugfs interface. Concurrent modification of the filter list via debugfs should be rare: the chosen strategy is to optimistically pre-allocate memory before the critical section and discard if unused. Link: https://lore.kernel.org/all/20240925143154.2322926-1-ranxiaokai627@163.com/ Reported-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Tested-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:29 +01:00
Maciej Fijalkowski	7d4a3d040e	bpf: fix OOB devmap writes when deleting elements commit ab244dd7cf4c291f82faacdc50b45cc0f55b674d upstream. Jordy reported issue against XSKMAP which also applies to DEVMAP - the index used for accessing map entry, due to being a signed integer, causes the OOB writes. Fix is simple as changing the type from int to u32, however, when compared to XSKMAP case, one more thing needs to be addressed. When map is released from system via dev_map_free(), we iterate through all of the entries and an iterator variable is also an int, which implies OOB accesses. Again, change it to be u32. Example splat below: [ 160.724676] BUG: unable to handle page fault for address: ffffc8fc2c001000 [ 160.731662] #PF: supervisor read access in kernel mode [ 160.736876] #PF: error_code(0x0000) - not-present page [ 160.742095] PGD 0 P4D 0 [ 160.744678] Oops: Oops: 0000 [#1] PREEMPT SMP [ 160.749106] CPU: 1 UID: 0 PID: 520 Comm: kworker/u145:12 Not tainted 6.12.0-rc1+ #487 [ 160.757050] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [ 160.767642] Workqueue: events_unbound bpf_map_free_deferred [ 160.773308] RIP: 0010:dev_map_free+0x77/0x170 [ 160.777735] Code: 00 e8 fd 91 ed ff e8 b8 73 ed ff 41 83 7d 18 19 74 6e 41 8b 45 24 49 8b bd f8 00 00 00 31 db 85 c0 74 48 48 63 c3 48 8d 04 c7 <48> 8b 28 48 85 ed 74 30 48 8b 7d 18 48 85 ff 74 05 e8 b3 52 fa ff [ 160.796777] RSP: 0018:ffffc9000ee1fe38 EFLAGS: 00010202 [ 160.802086] RAX: ffffc8fc2c001000 RBX: 0000000080000000 RCX: 0000000000000024 [ 160.809331] RDX: 0000000000000000 RSI: 0000000000000024 RDI: ffffc9002c001000 [ 160.816576] RBP: 0000000000000000 R08: 0000000000000023 R09: 0000000000000001 [ 160.823823] R10: 0000000000000001 R11: 00000000000ee6b2 R12: dead000000000122 [ 160.831066] R13: ffff88810c928e00 R14: ffff8881002df405 R15: 0000000000000000 [ 160.838310] FS: 0000000000000000(0000) GS:ffff8897e0c40000(0000) knlGS:0000000000000000 [ 160.846528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 160.852357] CR2: ffffc8fc2c001000 CR3: 0000000005c32006 CR4: 00000000007726f0 [ 160.859604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 160.866847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 160.874092] PKRU: 55555554 [ 160.876847] Call Trace: [ 160.879338] <TASK> [ 160.881477] ? __die+0x20/0x60 [ 160.884586] ? page_fault_oops+0x15a/0x450 [ 160.888746] ? search_extable+0x22/0x30 [ 160.892647] ? search_bpf_extables+0x5f/0x80 [ 160.896988] ? exc_page_fault+0xa9/0x140 [ 160.900973] ? asm_exc_page_fault+0x22/0x30 [ 160.905232] ? dev_map_free+0x77/0x170 [ 160.909043] ? dev_map_free+0x58/0x170 [ 160.912857] bpf_map_free_deferred+0x51/0x90 [ 160.917196] process_one_work+0x142/0x370 [ 160.921272] worker_thread+0x29e/0x3b0 [ 160.925082] ? rescuer_thread+0x4b0/0x4b0 [ 160.929157] kthread+0xd4/0x110 [ 160.932355] ? kthread_park+0x80/0x80 [ 160.936079] ret_from_fork+0x2d/0x50 [ 160.943396] ? kthread_park+0x80/0x80 [ 160.950803] ret_from_fork_asm+0x11/0x20 [ 160.958482] </TASK> Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references") CC: stable@vger.kernel.org Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Jordy Zomer <jordyzomer@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20241122121030.716788-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-17 13:24:29 +01:00
Kuan-Wei Chiu	c31c8dd62f	tracing: Fix cmp_entries_dup() to respect sort() comparison rules commit e63fbd5f6810ed756bbb8a1549c7d4132968baa9 upstream. The cmp_entries_dup() function used as the comparator for sort() violated the symmetry and transitivity properties required by the sorting algorithm. Specifically, it returned 1 whenever memcmp() was non-zero, which broke the following expectations: * Symmetry: If x < y, then y > x. * Transitivity: If x < y and y < z, then x < z. These violations could lead to incorrect sorting and failure to correctly identify duplicate elements. Fix the issue by directly returning the result of memcmp(), which adheres to the required comparison properties. Cc: stable@vger.kernel.org Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map") Link: https://lore.kernel.org/20241203202228.1274403-1-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-17 13:24:28 +01:00
Hou Tao	6260fcd754	bpf: Fix exact match conditions in trie_get_next_key() [ Upstream commit 27abc7b3fa2e09bbe41e2924d328121546865eda ] trie_get_next_key() uses node->prefixlen == key->prefixlen to identify an exact match, However, it is incorrect because when the target key doesn't fully match the found node (e.g., node->prefixlen != matchlen), these two nodes may also have the same prefixlen. It will return expected result when the passed key exist in the trie. However when a recently-deleted key or nonexistent key is passed to trie_get_next_key(), it may skip keys and return incorrect result. Fix it by using node->prefixlen == matchlen to identify exact matches. When the condition is true after the search, it also implies node->prefixlen equals key->prefixlen, otherwise, the search would return NULL instead. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:28 +01:00
Hou Tao	50e0b53ff1	bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie [ Upstream commit eae6a075e9537dd69891cf77ca5a88fa8a28b4a1 ] Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST flags. These flags can be specified by users and are relevant since LPM trie supports exact matches during update. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:28 +01:00
Levi Yun	c20eee8bbe	trace/trace_event_perf: remove duplicate samples on the first tracepoint event [ Upstream commit afe5960dc208fe069ddaaeb0994d857b24ac19d1 ] When a tracepoint event is created with attr.freq = 1, 'hwc->period_left' is not initialized correctly. As a result, in the perf_swevent_overflow() function, when the first time the event occurs, it calculates the event overflow and the perf_swevent_set_period() returns 3, this leads to the event are recorded for three duplicate times. Step to reproduce: 1. Enable the tracepoint event & starting tracing $ echo 1 > /sys/kernel/tracing/events/module/module_free $ echo 1 > /sys/kernel/tracing/tracing_on 2. Record with perf $ perf record -a --strict-freq -F 1 -e "module:module_free" 3. Trigger module_free event. $ modprobe -i sunrpc $ modprobe -r sunrpc Result: - Trace pipe result: $ cat trace_pipe modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc - perf sample: modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc modprobe 174509 [003] 6504.868980: module:module_free: sunrpc By setting period_left via perf_swevent_set_period() as other sw_event did, This problem could be solved. After patch: - Trace pipe result: $ cat trace_pipe modprobe 1153096 [068] 613468.867774: module:module_free: xfs - perf sample modprobe 1153096 [068] 613468.867794: module:module_free: xfs Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment") Signed-off-by: Levi Yun <yeoreum.yun@arm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:08 +01:00
Chen Ridong	5c074418aa	cgroup/bpf: only cgroup v2 can be attached by bpf programs [ Upstream commit 2190df6c91373fdec6db9fc07e427084f232f57e ] Only cgroup v2 can be attached by bpf programs, so this patch introduces that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in cgroup v2, and this can fix the memleak mentioned by commit 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which has been reverted. Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:02 +01:00
Chen Ridong	f70b602307	Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline" [ Upstream commit feb301c60970bd2a1310a53ce2d6e4375397a51b ] This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba. Only cgroup v2 can be attached by cgroup by BPF programs. Revert this commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in cgroup v1. The memory leak issue will be fixed with next patch. Fixes: 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:02 +01:00
Miguel Ojeda	1d4d4e5c76	time: Fix references to _msecs_to_jiffies() handling of values [ Upstream commit 92b043fd995a63a57aae29ff85a39b6f30cd440c ] The details about the handling of the "normal" values were moved to the _msecs_to_jiffies() helpers in commit ca42aaf0c861 ("time: Refactor msecs_to_jiffies"). However, the same commit still mentioned __msecs_to_jiffies() in the added documentation. Thus point to _msecs_to_jiffies() instead. Fixes: ca42aaf0c861 ("time: Refactor msecs_to_jiffies") Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241025110141.157205-2-ojeda@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:00 +01:00
Paul E. McKenney	3f7db10a69	rcu-tasks: Idle tasks on offline CPUs are in quiescent states commit 5c9a9ca44fda41c5e82f50efced5297a9c19760d upstream. Any idle task corresponding to an offline CPU is in an RCU Tasks Trace quiescent state. This commit causes rcu_tasks_trace_postscan() to ignore idle tasks for offline CPUs, which it can do safely due to CPU-hotplug operations being disabled. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:23:58 +01:00
guoweikang	be2d07b4f2	ftrace: Fix regression with module command in stack_trace_filter commit 45af52e7d3b8560f21d139b3759735eead8b1653 upstream. When executing the following command: # echo "write*:mod:ext3" > /sys/kernel/tracing/stack_trace_filter The current mod command causes a null pointer dereference. While commit 0f17976568b3f ("ftrace: Fix regression with module command in stack_trace_filter") has addressed part of the issue, it left a corner case unhandled, which still results in a kernel crash. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241120052750.275463-1-guoweikang.kernel@gmail.com Fixes: 04ec7bb642b77 ("tracing: Have the trace_array hold the list of registered func probes"); Signed-off-by: guoweikang <guoweikang.kernel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-17 13:20:50 +01:00
Ksawlii	90bc687dab	Revert "make sysctl constants read-only" This reverts commit `1f63f26cd2`.	2024-12-03 19:57:26 +01:00
Rik van Riel	2a2e1902e3	bpf: use kvzmalloc to allocate BPF verifier environment [ Upstream commit 434247637c66e1be2bc71a9987d4c3f0d8672387 ] The kzmalloc call in bpf_check can fail when memory is very fragmented, which in turn can lead to an OOM kill. Use kvzmalloc to fall back to vmalloc when memory is too fragmented to allocate an order 3 sized bpf verifier environment. Admittedly this is not a very common case, and only happens on systems where memory has already been squeezed close to the limit, but this does not seem like much of a hot path, and it's a simple enough fix. Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-30 02:33:27 +01:00
Daniel Micay	1f63f26cd2	make sysctl constants read-only Most of this is extracted from the last publicly available version of the PaX patches where it's part of KERNEXEC as __read_only. It has been extended to a few more of these constants.	2024-11-30 02:16:12 +01:00
Ksawlii	9947944ca2	Revert "cgroup/cpuset: Prevent UAF in proc_cpuset_show()" This reverts commit `106a2662b1`.	2024-11-24 00:23:50 +01:00
Ksawlii	481b3fc579	Revert "dma-debug: avoid deadlock between dma debug vs printk and netconsole" This reverts commit `a4688d6248`.	2024-11-24 00:23:48 +01:00
Ksawlii	8a0aebb9d1	Revert "rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow" This reverts commit `84d6dbf54c`.	2024-11-24 00:23:46 +01:00
Ksawlii	04dba1cb02	Revert "cgroup: Protect css->cgroup write under css_set_lock" This reverts commit `630897cdcb`.	2024-11-24 00:23:40 +01:00
Ksawlii	93a14d2549	Revert "smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()" This reverts commit `0b1e77f743`.	2024-11-24 00:23:40 +01:00
Ksawlii	d5a08b5325	Revert "uprobes: Use kzalloc to allocate xol area" This reverts commit `9a18ce1f12`.	2024-11-24 00:23:37 +01:00
Ksawlii	afebebc45a	Revert "perf/aux: Fix AUX buffer serialization" This reverts commit `6d56f2f8a3`.	2024-11-24 00:23:37 +01:00
Ksawlii	1dfae2e328	Revert "rtmutex: Drop rt_mutex::wait_lock before scheduling" This reverts commit `07dcd58fea`.	2024-11-24 00:23:36 +01:00
Ksawlii	7534e491aa	Revert "cgroup: Make operations on the cgroup root_list RCU safe" This reverts commit `6c357bd6a8`.	2024-11-24 00:23:32 +01:00
Ksawlii	92c7fd94e3	Revert "x86/ibt,ftrace: Search for __fentry__ location" This reverts commit `f6f1a8e333`.	2024-11-24 00:23:31 +01:00
Ksawlii	cf372342bd	Revert "ftrace: Fix possible use-after-free issue in ftrace_location()" This reverts commit `2c12c9f7ef`.	2024-11-24 00:23:31 +01:00
Ksawlii	88d2580c4c	Revert "padata: Honor the caller's alignment in case of chunk_size 0" This reverts commit `d937fc3fb1`.	2024-11-24 00:23:30 +01:00
Ksawlii	9629f5d149	Revert "kthread: add kthread_work tracepoints" This reverts commit `0944044e57`.	2024-11-24 00:23:23 +01:00
Ksawlii	e6affe90e3	Revert "kthread: fix task state in kthread worker if being frozen" This reverts commit `b1ce87a881`.	2024-11-24 00:23:23 +01:00
Ksawlii	fdff7f158f	Revert "bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit" This reverts commit `1f10bbe850`.	2024-11-24 00:23:22 +01:00
Ksawlii	40967b338d	Reapply "bpf: Fix DEVMAP_HASH overflow check on 32-bit arches" This reverts commit `ccff2373d6`.	2024-11-24 00:23:17 +01:00
Ksawlii	d439656b70	Reapply "bpf: Eliminate rlimit-based memory accounting for devmap maps" This reverts commit `41a87832ee`.	2024-11-24 00:23:17 +01:00
Ksawlii	78f121c9d6	Revert "bpf: Fix DEVMAP_HASH overflow check on 32-bit arches" This reverts commit `736d05560d`.	2024-11-24 00:23:17 +01:00
Ksawlii	268d243925	Revert "padata: use integer wrap around to prevent deadlock on seq_nr overflow" This reverts commit `3b9a874cc0`.	2024-11-24 00:23:15 +01:00
Ksawlii	c848ca572d	Revert "lockdep: fix deadlock issue between lockdep and rcu" This reverts commit `6624949eca`.	2024-11-24 00:23:13 +01:00
Ksawlii	62994ec6e3	Revert "signal: Replace BUG_ON()s" This reverts commit `dd7f63056a`.	2024-11-24 00:23:07 +01:00
Ksawlii	734a82005c	Revert "rcuscale: Provide clear error when async specified without primitives" This reverts commit `6a6821675d`.	2024-11-24 00:23:07 +01:00
Ksawlii	06e1842e46	Revert "perf/core: Fix small negative period being ignored" This reverts commit `347040dc8f`.	2024-11-24 00:23:05 +01:00
Ksawlii	adff2b9a7c	Revert "uprobes: fix kernel info leak via "[uprobes]" vma" This reverts commit `6ec781ea39`.	2024-11-24 00:23:00 +01:00
Ksawlii	3447182bc8	Revert "tracing: Remove precision vsnprintf() check from print event" This reverts commit `67512a7336`.	2024-11-24 00:22:59 +01:00
Ksawlii	bc40c2aebd	Revert "tracing: Have saved_cmdlines arrays all in one allocation" This reverts commit `6154d7268a`.	2024-11-24 00:22:59 +01:00
Ksawlii	bef5e7428f	Revert "kallsyms: Make kallsyms_on_each_symbol generally available" This reverts commit `9dc3580302`.	2024-11-24 00:22:59 +01:00
Ksawlii	62030ec18e	Revert "kallsyms: Make module_kallsyms_on_each_symbol generally available" This reverts commit `7b296fe960`.	2024-11-24 00:22:59 +01:00
Ksawlii	5fca109408	Revert "tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols" This reverts commit `058ee93b4b`.	2024-11-24 00:22:59 +01:00
Ksawlii	93ad10a774	Revert "tracing/kprobes: Fix symbol counting logic by looking at modules as well" This reverts commit `ef8f154c7c`.	2024-11-24 00:22:59 +01:00
Ksawlii	5ec15cdd1e	Revert "bpf: Check percpu map value size first" This reverts commit `22c88ba3cc`.	2024-11-24 00:22:59 +01:00
Ksawlii	a8299c44cf	Revert "resource: fix region_intersects() vs add_memory_driver_managed()" This reverts commit `29fa0c8808`.	2024-11-24 00:22:55 +01:00
Byeonguk Jeong	6b0b24ccd0	bpf: Fix out-of-bounds write in trie_get_next_key() [ Upstream commit 13400ac8fb80c57c2bfb12ebd35ee121ce9b4d21 ] trie_get_next_key() allocates a node stack with size trie->max_prefixlen, while it writes (trie->max_prefixlen + 1) nodes to the stack when it has full paths from the root to leaves. For example, consider a trie with max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Tested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:22:03 +01:00
Xiu Jianfeng	ed45b3440f	cgroup: Fix potential overflow issue when checking max_depth [ Upstream commit 3cc4e13bb1617f6a13e5e6882465984148743cf4 ] cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX. Fix it by starting the level from 0 and using '>=' instead. It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically. Fixes: 1a926e0bbab8 ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:22:02 +01:00
Jinjie Ruan	a2c183d885	posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime() [ Upstream commit 6e62807c7fbb3c758d233018caf94dfea9c65dbd ] If get_clock_desc() succeeds, it calls fget() for the clockid's fd, and get the clk->rwsem read lock, so the error path should release the lock to make the lock balance and fput the clockid's fd to make the refcount balance and release the fd related resource. However the below commit left the error path locked behind resulting in unbalanced locking. Check timespec64_valid_strict() before get_clock_desc() to fix it, because the "ts" is not changed after that. Fixes: d8794ac20a29 ("posix-clock: Fix missing timespec64 check in pc_clock_settime()") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de> [pabeni@redhat.com: fixed commit message typo] Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:22:01 +01:00
Leo Yan	2c7cc1a836	tracing: Consider the NULL character when validating the event length [ Upstream commit 0b6e2e22cb23105fcb171ab92f0f7516c69c8471 ] strlen() returns a string length excluding the null byte. If the string length equals to the maximum buffer length, the buffer will have no space for the NULL terminating character. This commit checks this condition and returns failure for it. Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/ Fixes: dec65d79fd26 ("tracing/probe: Check event name length correctly") Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:22:00 +01:00
Jinjie Ruan	211517137a	posix-clock: Fix missing timespec64 check in pc_clock_settime() commit d8794ac20a299b647ba9958f6d657051fc51a540 upstream. As Andrew pointed out, it will make sense that the PTP core checked timespec64 struct's tv_sec and tv_nsec range before calling ptp->info->settime64(). As the man manual of clock_settime() said, if tp.tv_sec is negative or tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL, which include dynamic clocks which handles PTP clock, and the condition is consistent with timespec64_valid(). As Thomas suggested, timespec64_valid() only check the timespec is valid, but not ensure that the time is in a valid range, so check it ahead using timespec64_valid_strict() in pc_clock_settime() and return -EINVAL if not valid. There are some drivers that use tp->tv_sec and tp->tv_nsec directly to write registers without validity checks and assume that the higher layer has checked it, which is dangerous and will benefit from this, such as hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(), and some drivers can remove the checks of itself. Cc: stable@vger.kernel.org Fixes: 0606f422b453 ("posix clocks: Introduce dynamic clocks") Acked-by: Richard Cochran <richardcochran@gmail.com> Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:54 +01:00
Huang Ying	29fa0c8808	resource: fix region_intersects() vs add_memory_driver_managed() commit b4afe4183ec77f230851ea139d91e5cf2644c68b upstream. On a system with CXL memory, the resource tree (/proc/iomem) related to CXL memory may look like something as follows. 490000000-50fffffff : CXL Window 0 490000000-50fffffff : region0 490000000-50fffffff : dax0.0 490000000-50fffffff : System RAM (kmem) Because drivers/dax/kmem.c calls add_memory_driver_managed() during onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL Window X". This confuses region_intersects(), which expects all "System RAM" resources to be at the top level of iomem_resource. This can lead to bugs. For example, when the following command line is executed to write some memory in CXL memory range via /dev/mem, $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1 dd: error writing '/dev/mem': Bad address 1+0 records in 0+0 records out 0 bytes copied, 0.0283507 s, 0.0 kB/s the command fails as expected. However, the error code is wrong. It should be "Operation not permitted" instead of "Bad address". More seriously, the /dev/mem permission checking in devmem_is_allowed() passes incorrectly. Although the accessing is prevented later because ioremap() isn't allowed to map system RAM, it is a potential security issue. During command executing, the following warning is reported in the kernel log for calling ioremap() on system RAM. ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d Call Trace: memremap+0xcb/0x184 xlate_dev_mem_ptr+0x25/0x2f write_mem+0x94/0xfb vfs_write+0x128/0x26d ksys_write+0xac/0xfe do_syscall_64+0x9a/0xfd entry_SYSCALL_64_after_hwframe+0x4b/0x53 The details of command execution process are as follows. In the above resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a top level resource. So, region_intersects() will report no System RAM resources in the CXL memory region incorrectly, because it only checks the top level resources. Consequently, devmem_is_allowed() will return 1 (allow access via /dev/mem) for CXL memory region incorrectly. Fortunately, ioremap() doesn't allow to map System RAM and reject the access. So, region_intersects() needs to be fixed to work correctly with the resource tree with "System RAM" not at top level as above. To fix it, if we found a unmatched resource in the top level, we will continue to search matched resources in its descendant resources. So, we will not miss any matched resources in resource tree anymore. In the new implementation, an example resource tree \|------------- "CXL Window 0" ------------\| \|-- "System RAM" --\| will behave similar as the following fake resource tree for region_intersects(, IORESOURCE_SYSTEM_RAM, ), \|-- "System RAM" --\|\|-- "CXL Window 0a" --\| Where "CXL Window 0a" is part of the original "CXL Window 0" that isn't covered by "System RAM". Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:52 +01:00
Tao Chen	22c88ba3cc	bpf: Check percpu map value size first [ Upstream commit 1d244784be6b01162b732a5a7d637dfc024c3203 ] Percpu map is often used, but the map value size limit often ignored, like issue: https://github.com/iovisor/bcc/issues/2519. Actually, percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first, like percpu map of local_storage. Maybe the error message seems clearer compared with "cannot allocate memory". Signed-off-by: Jinke Han <jinkehan@didiglobal.com> Signed-off-by: Tao Chen <chen.dylane@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:49 +01:00
Andrii Nakryiko	ef8f154c7c	tracing/kprobes: Fix symbol counting logic by looking at modules as well commit 926fe783c8a64b33997fec405cf1af3e61aed441 upstream. Recent changes to count number of matching symbols when creating a kprobe event failed to take into account kernel modules. As such, it breaks kprobes on kernel module symbols, by assuming there is no match. Fix this my calling module_kallsyms_on_each_symbol() in addition to kallsyms_on_each_match_symbol() to perform a proper counting. Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/ Cc: Francis Laniel <flaniel@linux.microsoft.com> Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Markus Boehme <markubo@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Sherry: It's a fix for previous backport, thus backport together] Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:49 +01:00
Francis Laniel	058ee93b4b	tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols commit b022f0c7e404887a7c5229788fc99eff9f9a80d5 upstream. When a kprobe is attached to a function that's name is not unique (is static and shares the name with other functions in the kernel), the kprobe is attached to the first function it finds. This is a bug as the function that it is attaching to is not necessarily the one that the user wants to attach to. Instead of blindly picking a function to attach to what is ambiguous, error with EADDRNOTAVAIL to let the user know that this function is not unique, and that the user must use another unique function with an address offset to get to the function they want to attach to. Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/ Cc: stable@vger.kernel.org Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer") Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Sherry: 5.10.y added a new kselftest kprobe_non_uniq_symbol.tc by backporting commit 09bcf9254838 ("selftests/ftrace: Add new test case which checks non unique symbol"). However, 5.10.y didn't backport this commit which provides unique symbol check suppport from kernel side. Minor conflicts due to context change, ignore context change] Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:49 +01:00
Jiri Olsa	7b296fe960	kallsyms: Make module_kallsyms_on_each_symbol generally available commit 73feb8d5fa3b755bb51077c0aabfb6aa556fd498 upstream. Making module_kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS and CONFIG_MODULES options are defined). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Signed-off-by: Markus Boehme <markubo@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 329197033bb0 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:49 +01:00
Jiri Olsa	9dc3580302	kallsyms: Make kallsyms_on_each_symbol generally available [ Upstream commit d721def7392a7348ffb9f3583b264239cbd3702c ] Making kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS option is defined). Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20220510122616.2652285-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Stable-dep-of: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Sherry Yang <sherry.yang@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:49 +01:00
Steven Rostedt (Google)	6154d7268a	tracing: Have saved_cmdlines arrays all in one allocation [ Upstream commit 0b18c852cc6fb8284ac0ab97e3e840974a6a8a64 ] The saved_cmdlines have three arrays for mapping PIDs to COMMs: - map_pid_to_cmdline[] - map_cmdline_to_pid[] - saved_cmdlines The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index into the other arrays. The map_cmdline_to_pid[] is a mapping back to the full pid as it can be larger than PID_MAX_DEFAULT. And the saved_cmdlines[] just holds the COMMs associated to the pids. Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated together (in reality the saved_cmdlines is just in the memory of the rounding of the allocation of the structure as it is always allocated in powers of two). The map_cmdline_to_pid[] array is allocated separately. Since the rounding to a power of two is rather large (it allows for 8000 elements in saved_cmdlines), also include the map_cmdline_to_pid[] array. (This drops it to 6000 by default, which is still plenty for most use cases). This saves even more memory as the map_cmdline_to_pid[] array doesn't need to be allocated. Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.068211d9@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240220140703.182330529@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:48 +01:00
Steven Rostedt (Google)	67512a7336	tracing: Remove precision vsnprintf() check from print event [ Upstream commit 5efd3e2aef91d2d812290dcb25b2058e6f3f532c ] This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/C7E7AF1A-D30F-4D18-B8E5-AF1EF58004F5@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279ac2@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a1674be@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240304174341.2a561d9f@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:48 +01:00
Oleg Nesterov	6ec781ea39	uprobes: fix kernel info leak via "[uprobes]" vma commit 34820304cc2cd1804ee1f8f3504ec77813d29c8e upstream. xol_add_vma() maps the uninitialized page allocated by __create_xol_area() into userspace. On some architectures (x86) this memory is readable even without VM_READ, VM_EXEC results in the same pgprot_t as VM_EXEC\|VM_READ, although this doesn't really matter, debugger can read this memory anyway. Link: https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/ Reported-by: Will Deacon <will@kernel.org> Fixes: d4b3b6384f98 ("uprobes/core: Allocate XOL slots for uprobes use") Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:47 +01:00
Luo Gengkun	347040dc8f	perf/core: Fix small negative period being ignored commit 62c0b1061593d7012292f781f11145b2d46f43ab upstream. In perf_adjust_period, we will first calculate period, and then use this period to calculate delta. However, when delta is less than 0, there will be a deviation compared to when delta is greater than or equal to 0. For example, when delta is in the range of [-14,-1], the range of delta = delta + 7 is between [-7,6], so the final value of delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored. This is unacceptable when the target period is very short, because we will lose a lot of samples. Here are some tests and analyzes: before: # perf record -e cs -F 1000 ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ] # perf script ... a.out 396 257.956048: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.957891: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.959730: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.961545: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.963355: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.965163: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.966973: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.968785: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.970593: 23 cs: ffffffff81f4eeec schedul> ... after: # perf record -e cs -F 1000 ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ] # perf script ... a.out 395 59.338813: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.339707: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.340682: 13 cs: ffffffff81f4eeec schedul> a.out 395 59.341751: 13 cs: ffffffff81f4eeec schedul> a.out 395 59.342799: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.343765: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.344651: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.345539: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.346502: 13 cs: ffffffff81f4eeec schedul> ... test.c int main() { for (int i = 0; i < 20000; i++) usleep(10); return 0; } # time ./a.out real 0m1.583s user 0m0.040s sys 0m0.298s The above results were tested on x86-64 qemu with KVM enabled using test.c as test program. Ideally, we should have around 1500 samples, but the previous algorithm had only about 500, whereas the modified algorithm now has about 1400. Further more, the new version shows 1 sample per 0.001s, while the previous one is 1 sample per 0.002s.This indicates that the new algorithm is more sensitive to small negative values compared to old algorithm. Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240831074316.2106159-2-luogengkun@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:43 +01:00
Paul E. McKenney	6a6821675d	rcuscale: Provide clear error when async specified without primitives [ Upstream commit 11377947b5861fa59bf77c827e1dd7c081842cc9 ] Currently, if the rcuscale module's async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude (which now lacks a call_rcu_tasks_rude() function), there will be a series of splats due to calls to a NULL pointer. This commit therefore warns of this situation, but switches to non-async testing. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:39 +01:00
Thomas Gleixner	dd7f63056a	signal: Replace BUG_ON()s [ Upstream commit 7f8af7bac5380f2d95a63a6f19964e22437166e1 ] These really can be handled gracefully without killing the machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:39 +01:00
Zhiguo Niu	6624949eca	lockdep: fix deadlock issue between lockdep and rcu commit a6f88ac32c6e63e69c595bfae220d8641704c9b7 upstream. There is a deadlock scenario between lockdep and rcu when rcu nocb feature is enabled, just as following call stack: rcuop/x -000\|queued_spin_lock_slowpath(lock = 0xFFFFFF817F2A8A80, val = ?) -001\|queued_spin_lock(inline) // try to hold nocb_gp_lock -001\|do_raw_spin_lock(lock = 0xFFFFFF817F2A8A80) -002\|__raw_spin_lock_irqsave(inline) -002\|_raw_spin_lock_irqsave(lock = 0xFFFFFF817F2A8A80) -003\|wake_nocb_gp_defer(inline) -003\|__call_rcu_nocb_wake(rdp = 0xFFFFFF817F30B680) -004\|__call_rcu_common(inline) -004\|call_rcu(head = 0xFFFFFFC082EECC28, func = ?) -005\|call_rcu_zapped(inline) -005\|free_zapped_rcu(ch = ?)// hold graph lock -006\|rcu_do_batch(rdp = 0xFFFFFF817F245680) -007\|nocb_cb_wait(inline) -007\|rcu_nocb_cb_kthread(arg = 0xFFFFFF817F245680) -008\|kthread(_create = 0xFFFFFF80803122C0) -009\|ret_from_fork(asm) rcuop/y -000\|queued_spin_lock_slowpath(lock = 0xFFFFFFC08291BBC8, val = 0) -001\|queued_spin_lock() -001\|lockdep_lock() -001\|graph_lock() // try to hold graph lock -002\|lookup_chain_cache_add() -002\|validate_chain() -003\|lock_acquire -004\|_raw_spin_lock_irqsave(lock = 0xFFFFFF817F211D80) -005\|lock_timer_base(inline) -006\|mod_timer(inline) -006\|wake_nocb_gp_defer(inline)// hold nocb_gp_lock -006\|__call_rcu_nocb_wake(rdp = 0xFFFFFF817F2A8680) -007\|__call_rcu_common(inline) -007\|call_rcu(head = 0xFFFFFFC0822E0B58, func = ?) -008\|call_rcu_hurry(inline) -008\|rcu_sync_call(inline) -008\|rcu_sync_func(rhp = 0xFFFFFFC0822E0B58) -009\|rcu_do_batch(rdp = 0xFFFFFF817F266680) -010\|nocb_cb_wait(inline) -010\|rcu_nocb_cb_kthread(arg = 0xFFFFFF817F266680) -011\|kthread(_create = 0xFFFFFF8080363740) -012\|ret_from_fork(asm) rcuop/x and rcuop/y are rcu nocb threads with the same nocb gp thread. This patch release the graph lock before lockdep call_rcu. Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") Cc: stable@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: Carlos Llamas <cmllamas@google.com> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Carlos Llamas <cmllamas@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20240620225436.3127927-1-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:34 +01:00
VanGiang Nguyen	3b9a874cc0	padata: use integer wrap around to prevent deadlock on seq_nr overflow commit 9a22b2812393d93d84358a760c347c21939029a6 upstream. When submitting more than 2^32 padata objects to padata_do_serial, the current sorting implementation incorrectly sorts padata objects with overflowed seq_nr, causing them to be placed before existing objects in the reorder list. This leads to a deadlock in the serialization process as padata_find_next cannot match padata->seq_nr and pd->processed because the padata instance with overflowed seq_nr will be selected next. To fix this, we use an unsigned integer wrap around to correctly sort padata objects in scenarios with integer overflow. Fixes: bfde23ce200e ("padata: unbind parallel jobs from specific CPUs") Cc: <stable@vger.kernel.org> Co-developed-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Signed-off-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Co-developed-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Van Giang Nguyen <vangiang.nguyen@rohde-schwarz.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:33 +01:00
Toke Høiland-Jørgensen	736d05560d	bpf: Fix DEVMAP_HASH overflow check on 32-bit arches [ Upstream commit 281d464a34f540de166cee74b723e97ac2515ec3 ] The devmap code allocates a number hash buckets equal to the next power of two of the max_entries value provided when creating the map. When rounding up to the next power of two, the 32-bit variable storing the number of buckets can overflow, and the code checks for overflow by checking if the truncated 32-bit value is equal to 0. However, on 32-bit arches the rounding up itself can overflow mid-way through, because it ends up doing a left-shift of 32 bits on an unsigned long value. If the size of an unsigned long is four bytes, this is undefined behaviour, so there is no guarantee that we'll end up with a nice and tidy 0-value at the end. Syzbot managed to turn this into a crash on arm32 by creating a DEVMAP_HASH with max_entries > 0x80000000 and then trying to update it. Fix this by moving the overflow check to before the rounding up operation. Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index") Link: https://lore.kernel.org/r/000000000000ed666a0611af6818@google.com Reported-and-tested-by: syzbot+8cd36f6b65f3cafd400a@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Message-ID: <20240307120340.99577-2-toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:29 +01:00
Pu Lehui	41a87832ee	Revert "bpf: Eliminate rlimit-based memory accounting for devmap maps" This reverts commit 70294d8bc31f3b7789e5e32f757aa9344556d964 which is commit 844f157f6c0a905d039d2e20212ab3231f2e5eaf upstream. Commit 70294d8bc31f ("bpf: Eliminate rlimit-based memory accounting for devmap maps") is part of the v5.11+ base mechanism of memcg-based memory accounting[0]. The commit cannot be independently backported to the 5.10 stable branch, otherwise the related memory when creating devmap will be unrestricted. Let's roll back to rlimit-based memory accounting mode for devmap. Link: https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com [0] Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:29 +01:00

1 2 3 4 5 ...

423 commits