kernel_samsung_a53x

Author	SHA1	Message	Date
Tyler Nijmeh	e983caf81d	sched: core: Minimize number of tasks to load balance We don't want sched_nr_migrate to be too high, as that would impact real-time latencies as the load balancing process for SCHED_OTHER disallows IRQ interrupts. Instead of making this a constant value, let's use an amount that relates to the CPU performance of a device. Consider that a device with a very weak 4 core processor needs to load balance 32 tasks from the most busy-but-idle CPU. That balancing would take significantly longer than a device with 8 cores. On the contrary view, that same 8 core device balancing 32 tasks would have no issues with SCHED_OTHER performance, however, real-time tasks would suffer more jitter than is necessary. Let's only balance as many tasks as there are CPUs in a device for optimal SCHED_OTHER performance and SCHED_FIFO / SCHED_RR latency.	2024-12-17 20:32:22 +01:00
Nahuel Gómez	50e7a3b302	fs,kernel,mm: tune to Ktweak balance Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-17 20:30:06 +01:00
Nahuel Gómez	1811b5b3e5	sched/fair: apply init protection Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-17 20:28:06 +01:00
Park Ju Hyung	26944181d5	sysctl: promote several nodes out of CONFIG_SCHED_DEBUG These are used in Android. Promote these to disable CONFIG_SCHED_DEBUG. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> [0ctobot: Adapted for 4.19] Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Change-Id: I8053176882e155926769939de15da375e7d548a0	2024-12-17 20:27:04 +01:00
K Prateek Nayak	e6fdbec5ef	sched/core: Prevent wakeup of ksoftirqd during idle load balance [ Upstream commit e932c4ab38f072ce5894b2851fea8bc5754bb8e5 ] Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on from the IPI handler on the idle CPU. If the SMP function is invoked from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd because soft interrupts are handled before ksoftirqd get on the CPU. Adding a trace_printk() in nohz_csd_func() at the spot of raising SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup, and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the current behavior: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func <idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000 <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED] <idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120 ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 ... Use __raise_softirq_irqoff() to raise the softirq. The SMP function call is always invoked on the requested CPU in an interrupt handler. It is guaranteed that soft interrupts are handled at the end. Following are the observations with the changes when enabling the same set of events: <idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance <idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED] <idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED] No unnecessary ksoftirqd wakeups are seen from idle task's context to service the softirq. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1] Reported-by: Julia Lawall <julia.lawall@inria.fr> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
K Prateek Nayak	d4d43810f2	sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy [ Upstream commit ff47a0acfcce309cf9e175149c75614491953c8f ] Commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the TIF_NEED_RESCHED flag in idle task's thread info and relying on flush_smp_call_function_queue() in idle exit path to run the call-function. A softirq raised by the call-function is handled shortly after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag remains set and is only cleared later when schedule_idle() calls __schedule(). need_resched() check in _nohz_idle_balance() exists to bail out of load balancing if another task has woken up on the CPU currently in-charge of idle load balancing which is being processed in SCHED_SOFTIRQ context. Since the optimization mentioned above overloads the interpretation of TIF_NEED_RESCHED, check for idle_cpu() before going with the existing need_resched() check which can catch a genuine task wakeup on an idle CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as well as the case where ksoftirqd needs to be preempted as a result of new task wakeup or slice expiry. In case of PREEMPT_RT or threadirqs, although the idle load balancing may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd is the only fair task going back to sleep will trigger a newidle balance on the CPU which will alleviate some imbalance if it exists if idle balance fails to do so. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Valentin Schneider	8a8ef40c42	sched/fair: Add NOHZ balancer flag for nohz.next_balance updates [ Upstream commit efd984c481abb516fab8bafb25bf41fd9397a43c ] A following patch will trigger NOHZ idle balances as a means to update nohz.next_balance. Vincent noted that blocked load updates can have non-negligible overhead, which should be avoided if the intent is to only update nohz.next_balance. Add a new NOHZ balance kick flag, NOHZ_NEXT_KICK. Gate NOHZ blocked load update by the presence of NOHZ_STATS_KICK - currently all NOHZ balance kicks will have the NOHZ_STATS_KICK flag set, so no change in behaviour is expected. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210823111700.2842997-2-valentin.schneider@arm.com Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	ab620a407a	sched/fair: Trigger the update of blocked load on newly idle cpu [ Upstream commit c6f886546cb8a38617cdbe755fe50d3acd2463e4 ] Instead of waking up a random and already idle CPU, we can take advantage of this_cpu being about to enter idle to run the ILB and update the blocked load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	9ae9714a14	sched/fair: Merge for each idle cpu loop of ILB [ Upstream commit 7a82e5f52a3506bc35a4dc04d53ad2c9daf82e7f ] Remove the specific case for handling this_cpu outside for_each_cpu() loop when running ILB. Instead we use for_each_cpu_wrap() and start with the next cpu after this_cpu so we will continue to finish with this_cpu. update_nohz_stats() is now used for this_cpu too and will prevents unnecessary update. We don't need a special case for handling the update of nohz.next_balance for this_cpu anymore because it is now handled by the loop like others. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-5-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	4ae526c326	sched/fair: Remove unused parameter of update_nohz_stats [ Upstream commit 64f84f273592d17dcdca20244168ad9f525a39c3 ] idle load balance is the only user of update_nohz_stats and doesn't use force parameter. Remove it Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-4-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Vincent Guittot	fe0cdb4e3f	sched/fair: Remove update of blocked load from newidle_balance [ Upstream commit 0826530de3cbdc89e60a89e86def94a5f0fc81ca ] newidle_balance runs with both preempt and irq disabled which prevent local irq to run during this period. The duration for updating the blocked load of CPUs varies according to the number of CPU cgroups with non-decayed load and extends this critical period to an uncontrolled level. Remove the update from newidle_balance and trigger a normal ILB that will take care of the update instead. This reduces the IRQ latency from O(nr_cgroups * nr_nohz_cpus) to O(nr_cgroups). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-2-vincent.guittot@linaro.org Stable-dep-of: ff47a0acfcce ("sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
K Prateek Nayak	d08554baac	sched/core: Remove the unnecessary need_resched() check in nohz_csd_func() [ Upstream commit ea9cffc0a154124821531991d5afdd7e8b20d7aa ] The need_resched() check currently in nohz_csd_func() can be tracked to have been added in scheduler_ipi() back in 2011 via commit ca38062e57e9 ("sched: Use resched IPI to kick off the nohz idle balance") Since then, it has travelled quite a bit but it seems like an idle_cpu() check currently is sufficient to detect the need to bail out from an idle load balancing. To justify this removal, consider all the following case where an idle load balancing could race with a task wakeup: o Since commit f3dd3f674555b ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") a target perceived to be idle (target_rq->nr_running == 0) will return true for ttwu_queue_cond(target) which will offload the task wakeup to the idle target via an IPI. In all such cases target_rq->ttwu_pending will be set to 1 before queuing the wake function. If an idle load balance races here, following scenarios are possible: - The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual IPI is sent to the CPU to wake it out of idle. If the nohz_csd_func() queues before sched_ttwu_pending(), the idle load balance will bail out since idle_cpu(target) returns 0 since target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after sched_ttwu_pending() it should see rq->nr_running to be non-zero and bail out of idle load balancing. - The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI, the sender will simply set TIF_NEED_RESCHED for the target to put it out of idle and flush_smp_call_function_queue() in do_idle() will execute the call function. Depending on the ordering of the queuing of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in nohz_csd_func() should either see target_rq->ttwu_pending = 1 or target_rq->nr_running to be non-zero if there is a genuine task wakeup racing with the idle load balance kick. o The waker CPU perceives the target CPU to be busy (targer_rq->nr_running != 0) but the CPU is in fact going idle and due to a series of unfortunate events, the system reaches a case where the waker CPU decides to perform the wakeup by itself in ttwu_queue() on the target CPU but target is concurrently selected for idle load balance (XXX: Can this happen? I'm not sure, but we'll consider the mother of all coincidences to estimate the worst case scenario). ttwu_do_activate() calls enqueue_task() which would increment "rq->nr_running" post which it calls wakeup_preempt() which is responsible for setting TIF_NEED_RESCHED (via a resched IPI or by setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key thing to note in this case is that rq->nr_running is already non-zero in case of a wakeup before TIF_NEED_RESCHED is set which would lead to idle_cpu() check returning false. In all cases, it seems that need_resched() check is unnecessary when checking for idle_cpu() first since an impending wakeup racing with idle load balancer will either set the "rq->ttwu_pending" or indicate a newly woken task via "rq->nr_running". Chasing the reason why this check might have existed in the first place, I came across Peter's suggestion on the fist iteration of Suresh's patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was: sched_ttwu_do_pending(list); if (unlikely((rq->idle == current) && rq->nohz_balance_kick && !need_resched())) raise_softirq_irqoff(SCHED_SOFTIRQ); Since the condition to raise the SCHED_SOFIRQ was preceded by sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in the current upstream kernel, the need_resched() check was necessary to catch a newly queued task. Peter suggested modifying it to: if (idle_cpu() && rq->nohz_balance_kick && !need_resched()) raise_softirq_irqoff(SCHED_SOFTIRQ); where idle_cpu() seems to have replaced "rq->idle == current" check. Even back then, the idle_cpu() check would have been sufficient to catch a new task being enqueued. Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") overloads the interpretation of TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based on Peter's suggestion. Fixes: b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-17 13:24:33 +01:00
Zheng Zucheng	185d961782	sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime commit 77baa5bafcbe1b2a15ef9c37232c21279c95481c upstream. In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:24 +01:00
Pierre Gondois	da0a9d1d3a	sched/fair: Use all little CPUs for CPU-bound workloads commit 3af7524b14198f5159a86692d57a9f28ec9375ce upstream. Running N CPU-bound tasks on an N CPUs platform: - with asymmetric CPU capacity - not being a DynamIq system (i.e. having a PKG level sched domain without the SD_SHARE_PKG_RESOURCES flag set) .. might result in a task placement where two tasks run on a big CPU and none on a little CPU. This placement could be more optimal by using all CPUs. Testing platform: Juno-r2: - 2 big CPUs (1-2), maximum capacity of 1024 - 4 little CPUs (0,3-5), maximum capacity of 383 Testing workload ([1]): Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks is affine to a CPU, except for: - one little CPU which is left idle. - one big CPU which has 2 tasks affine. After the 100ms (step 2), remove the cpumask affinity. Behavior before the patch: During step 2, the load balancer running from the idle CPU tags sched domains as: - little CPUs: 'group_has_spare'. Cf. group_has_capacity() and group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs sched-domain, and the idle CPU provides enough spare capacity regarding the imbalance_pct - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs sched-domain, so the following path is used: group_is_overloaded() \-if (sgs->sum_nr_running <= sgs->group_weight) return true; The following path which would change the migration type to 'migrate_task' is not taken: calculate_imbalance() \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) as the local group has some spare capacity, so the imbalance is not 0. The migration type requested is 'migrate_util' and the busiest runqueue is the big CPU's runqueue having 2 tasks (each having a utilization of 512). The idle little CPU cannot pull one of these task as its capacity is too small for the task. The following path is used: detach_tasks() \-case migrate_util: \-if (util > env->imbalance) goto next; After the patch: As the number of failed balancing attempts grows (with 'nr_balance_failed'), progressively make it easier to migrate a big task to the idling little CPU. A similar mechanism is used for the 'migrate_load' migration type. Improvement: Running the testing workload [1] with the step 2 representing a ~10s load for a big CPU: Before patch: ~19.3s After patch: ~18s (-6.7%) Similar issue reported at: https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/ Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:13 +01:00
Tejun Heo	799aef6e9d	sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks commit d329605287020c3d1c3b0dadc63d8208e7251382 upstream. When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:12 +01:00
Sultan Alsawaf	900245cda2	schedutil: Allow CPU frequency changes to be amended before they're set If the last CPU frequency selected isn't set before a new CPU frequency selection arrives, then use the new selection immediately to avoid using a stale frequency choice. This improves both performance and energy by more closely tracking the scheduler's latest decisions. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:06:02 +01:00
friedrich420	5afb8f94f1	Kernel/sched: Reduce Latency [Pafcholini] Signed-off-by: HolyAngel <slverwolf@gmail.com> Signed-off-by: Salllz <sal235222727@gmail.com> Signed-off-by: alanndz <alanndz7@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Little-W <1405481963@qq.com>	2024-11-19 18:05:31 +01:00
Sultan Alsawaf	419052d8e5	sched/fair: Compile out NUMA code entirely when NUMA is disabled Scheduler code is very hot and every little optimization counts. Instead of constantly checking sched_numa_balancing when NUMA is disabled, compile it out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:24 +01:00
Clement Courbet	d4b05cdad5	sched: Optimize __calc_delta() A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com	2024-11-19 18:05:19 +01:00
Qais Yousef	971267e87b	schedutil : cap iowait boost by uclamp_max Which is a backport of upstream fix: d37aee9018e6 ("sched/uclamp: Fix iowait boost escaping uclamp restriction") Bug: 261695814 Signed-off-by: Qais Yousef <qyousef@google.com> Change-Id: Ibe8175edb9dea35e325f1a6f4306885ab8b6b28a	2024-11-19 18:05:14 +01:00
Sultan Alsawaf	74cbd01416	sched/core: Use SCHED_RR in place of SCHED_FIFO for all users Although SCHED_FIFO is a real-time scheduling policy, it can have bad results on system latency, since each SCHED_FIFO task will run to completion before yielding to another task. This can result in visible micro-stalls when a SCHED_FIFO task hogs the CPU for too long. On a system where latency is favored over throughput, using SCHED_RR is a better choice than SCHED_FIFO. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Oktapra Amtono <oktapra.amtono@gmail.com> Signed-off-by: CloudedQuartz <ravenklawasd@gmail.com>	2024-11-19 18:04:58 +01:00
Yaroslav Furman	e7cede92a8	sched: core: silence no longer affine to cpu logspam Signed-off-by: engstk <eng.stk@sapo.pt>	2024-11-19 18:04:49 +01:00
Sultan Alsawaf	4861626fb1	schedutil: Don't affine sugov kthreads if DVFS is allowed from any CPU Restricting sugov kthreads to their respective CPUFreq policy's CPUs slows down schedutil's ability to switch frequencies. When DVFS is allowed from any CPU, allow respective sugov kthreads to run on any CPU for better performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:04:45 +01:00
Sultan Alsawaf	0b24a687cf	sched: Set sched_nr_migrate back to 32 on RT for Android Android isn't a real-time userspace and has lots of processes, which makes the normal sched_nr_migrate value of 32 more appealing. In addition, there's no observed latency reduction from using a sched_nr_migrate value of 8, probably because the shallowest idle state on mobile CPUs takes longer to enter/exit than it takes for the scheduler to do a load balance run, so our tail end latency is limited by cpuidle anyway.	2024-11-19 18:04:37 +01:00
Rafael J. Wysocki	bc903594c9	cpufreq: schedutil: Reduce frequencies slower The schedutil governor reduces frequencies too fast in some situations which cases undesirable performance drops to appear. To address that issue, make schedutil reduce the frequency slower by setting it to the average of the value chosen during the previous iteration of governor computations and the new one coming from its frequency selection formula. Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963 Reported-by: John <john.ettedgui@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Cykeek <Cykeek@proton.me> Signed-off-by: negrroo <mohammedaelnaggar1@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-11-19 18:04:33 +01:00
Nahuel Gómez	27fe6f89a2	kernel: sched: ems: drop usage of SCHED_FEAT We removed this. ../kernel/sched/ems/core.c:1370:23: error: use of undeclared identifier 'sched_feat_names' 1370 \| index = match_string(sched_feat_names, __SCHED_FEAT_NR, "TTWU_QUEUE"); \| ^ ../kernel/sched/ems/core.c:1370:41: error: use of undeclared identifier '__SCHED_FEAT_NR' 1370 \| index = match_string(sched_feat_names, __SCHED_FEAT_NR, "TTWU_QUEUE"); \| ^ ../kernel/sched/ems/core.c:1372:23: error: use of undeclared identifier 'sched_feat_keys' 1372 \| static_key_disable(&sched_feat_keys[index]); \| ^ ../kernel/sched/ems/core.c:1373:3: error: use of undeclared identifier 'sysctl_sched_features'; did you mean 'sysctl_sched_latency'? 1373 \| sysctl_sched_features &= ~(1UL << index); \| ^~~~~~~~~~~~~~~~~~~~~ \| sysctl_sched_latency ../include/linux/sched/sysctl.h:29:21: note: 'sysctl_sched_latency' declared here 29 \| extern unsigned int sysctl_sched_latency; \| ^ 4 errors generated. Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-19 17:52:14 +01:00
Sultan Alsawaf	d4bbaf5715	sched/core: Forbid Unity-based games from changing their CPU affinity Unity-based games (such as Wild Rift) like to shoot themselves in the foot by setting a nonsense CPU affinity, restricting the game to a narrow set of CPU cores that it thinks are the "big" cores in a heterogeneous CPU. It assumes that CPUs only have two performance domains (clusters), and therefore royally mucks up games' CPU affinities on CPUs which have more than two performance domains. Check if a setaffinity target task is part of a Unity-based game and silently ignore the setaffinity request so that it can't sabotage itself. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:43:59 +01:00
Sultan Alsawaf	fa6b06bf46	sched/fair: Always update CPU capacity when load balancing Limiting CPU capacity updates, which are quite cheap, results in worse balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE). This causes opportunistic placement decisions to be skewed using stale CPU capacity data, and when a CPU isn't idling much, its capacity suffers from even more staleness since the only exception to the 100 ms capacity update ratelimit is a CPU exiting idle. Since the capacity updates are cheap, always do it when load balancing in order to improve opportunistic task placement decisions. Change-Id: If1d451ce742fd093010057e31e71012d47fad70a Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:34:49 +01:00
Vitalii Bursov	e03ab0b806	sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level [ Upstream commit a1fd0b9d751f840df23ef0e75b691fc00cfd4743 ] Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 12:27:00 +01:00
Cyril Hrubis	83c90ef25c	sched/rt: Disallow writing invalid values to sched_rt_period_us commit 079be8fc630943d9fc70a97807feb73d169ee3fc upstream. The validation of the value written to sched_rt_period_us was broken because: - the sysclt_sched_rt_period is declared as unsigned int - parsed by proc_do_intvec() - the range is asserted after the value parsed by proc_do_intvec() Because of this negative values written to the file were written into a unsigned integer that were later on interpreted as large positive integers which did passed the check: if (sysclt_sched_rt_period <= 0) return EINVAL; This commit fixes the parsing by setting explicit range for both perid_us and runtime_us into the sched_rt_sysctls table and processes the values with proc_dointvec_minmax() instead. Alternatively if we wanted to use full range of unsigned int for the period value we would have to split the proc_handler and use proc_douintvec() for it however even the Documentation/scheduller/sched-rt-group.rst describes the range as 1 to INT_MAX. As far as I can tell the only problem this causes is that the sysctl file allows writing negative values which when read back may confuse userspace. There is also a LTP test being submitted for these sysctl files at: http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/ Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz [ pvorel: rebased for 5.15, 5.10 ] Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-18 22:25:32 +01:00
Cyril Hrubis	ae1abca9a3	sched/rt: Fix sysctl_sched_rr_timeslice intial value commit c7fcb99877f9f542c918509b2801065adcaf46fa upstream. There is a 10% rounding error in the intial value of the sysctl_sched_rr_timeslice with CONFIG_HZ_300=y. This was found with LTP test sched_rr_get_interval01: sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 What this test does is to compare the return value from the sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and fails if they do not match. The problem it found is the intial sysctl file value which was computed as: static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE; which works fine as long as MSEC_PER_SEC is multiple of HZ, however it introduces 10% rounding error for CONFIG_HZ_300: (MSEC_PER_SEC / HZ) * (100 * HZ / 1000) (1000 / 300) * (100 * 300 / 1000) 3 * 30 = 90 This can be easily fixed by reversing the order of the multiplication and division. After this fix we get: (MSEC_PER_SEC * (100 * HZ / 1000)) / HZ (1000 * (100 * 300 / 1000)) / 300 (1000 * 30) / 300 = 100 Fixes: 975e155ed873 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds") Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz [ pvorel: rebased for 5.15, 5.10 ] Signed-off-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-18 22:25:32 +01:00
Cyril Hrubis	8c4edfd430	sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset commit c1fc6484e1fb7cc2481d169bfef129a1b0676abe upstream. The sched_rr_timeslice can be reset to default by writing value that is <= 0. However after reading from this file we always got the last value written, which is not useful at all. $ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms $ cat /proc/sys/kernel/sched_rr_timeslice_ms -1 Fix this by setting the variable that holds the sysctl file value to the jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Cc: Mahmoud Adam <mngyadam@amazon.com> Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-18 22:25:31 +01:00
Linus Torvalds	b22fc42973	sched/membarrier: reduce the ability to hammer on sys_membarrier commit 944d5fe50f3f03daacfea16300e656a1691c4a23 upstream. On some systems, sys_membarrier can be very expensive, causing overall slowdowns for everything. So put a lock on the path in order to serialize the accesses to prevent the ability for this to be called at too high of a frequency and saturate the machine. Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Borislav Petkov <bp@alien8.de> Fixes: 22e4ebb97582 ("membarrier: Provide expedited private command") Fixes: c5f58bd58f43 ("membarrier: Provide GLOBAL_EXPEDITED command") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [ converted to explicit mutex_*() calls - cleanup.h is not in this stable branch - gregkh ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-18 12:13:39 +01:00
Ksawlii	2674d4402d	Revert "kernel: ems/ego: Allow CPU frequency changes to be amended before they're set" This reverts commit `5d1ef2f0ad`.	2024-11-18 07:48:15 +01:00
Sultan Alsawaf	669f8aa664	sched/completion: Expose wait_for_common() to drivers Allow drivers to wait with a custom task state specified by exposing the raw wait_for_common() functions. This allows code to wait for completions that are invariant with respect to CPU performance without contributing to load avg, without requiring the wait to be interruptible. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-17 17:45:08 +01:00
Sultan Alsawaf	5d1ef2f0ad	kernel: ems/ego: Allow CPU frequency changes to be amended before they're set If the last CPU frequency selected isn't set before a new CPU frequency selection arrives, then use the new selection immediately to avoid using a stale frequency choice. This improves both performance and energy by more closely tracking the scheduler's latest decisions. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> [Flopster101: Adapted to Exynos energy_aware governor] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-17 17:42:09 +01:00
Rafael J. Wysocki	51d3ee0bf3	kernel: ems/ego: Reduce frequencies slower The schedutil governor reduces frequencies too fast in some situations which cases undesirable performance drops to appear. To address that issue, make schedutil reduce the frequency slower by setting it to the average of the value chosen during the previous iteration of governor computations and the new one coming from its frequency selection formula. Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963 Reported-by: John <john.ettedgui@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Cykeek <Cykeek@proton.me> Signed-off-by: negrroo <mohammedaelnaggar1@gmail.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com> [Flopster101: Adapted to Exynos energy_aware governor] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-17 17:42:02 +01:00
Sultan Alsawaf	cdf47a7386	kernel: ems/ego: Set default up/down rate limits to 500/1000 us This is empirically observed to yield good performance with reduced power consumption via having the down rate limit configured to be 2x longer than the up rate limit. This reduces bouncing between CPU frequencies by stalling down-clocking, which not only improves performance, but also counter-intuitively improves power consumption. The short up/down rate limits also provide improved interactivity and real-time response. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> [Flopster101: Adapted to Exynos energy_aware governor] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-17 17:41:59 +01:00
Pzqqt	3de61e729d	kernel: sched: Provide more PELT half-life options - Regenerate `kernel/sched/sched-pelt.h` by `Documentation/scheduler/sched-pelt`. - Now we can choose from 32ms (default), 16ms, 12ms, 8ms.	2024-11-17 17:41:17 +01:00
Pzqqt	648fb626ad	kernel: sched: Configuring PELT half-life via Kconfig Note that adjusting PELT half-life via kernel parameters is only allowed when CONFIG_PELT_UTIL_HALFLIFE_DEFAULT is selected.	2024-11-17 17:41:11 +01:00
darkhz	bf2ac59ec9	sched/uclamp: Fix incorrect uclamp.latency_sensitive setting This patch fixes the latency_sensitive flag for all cpuset cgroups, and the value present in the uclamp.latency_sensitive node directly corresponds to the task_group's latency_sensitive value. Prior to this patch, this was not the case. The uclamp_latency_sensitive() function applied values only to the cpu cgroup subsys instead of the required cpuset cgroup subsys, as a result of which the latency_sensitive value remained zero for all taskgroups irrespective of its setting. Also, fix a situation where latency_sensitive is enabled for the cpuset's root cgroup, in which case all tasks will have their value as 1, which in turn will enable prefer_idle for all tasks. This is undesired and may cause high battery drain.	2024-11-17 17:38:14 +01:00
Qais Yousef	17cc903017	kernel: ems/ego: cap iowait boost by uclamp_max Which is a backport of upstream fix: d37aee9018e6 ("sched/uclamp: Fix iowait boost escaping uclamp restriction") Bug: 261695814 Signed-off-by: Qais Yousef <qyousef@google.com> Change-Id: Ibe8175edb9dea35e325f1a6f4306885ab8b6b28a [Flopster101: Adapted to Exynos energy_aware governor] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-17 17:37:53 +01:00
Gabriel2392	9721b7ac13	treewide: Fix build errors with clang18	2024-06-15 16:28:49 -03:00
Gabriel2392	7ed7ee9edf	Import A536BXXU9EXDC	2024-06-15 16:02:09 -03:00

44 commits