kernel_samsung_a53x

Author	SHA1	Message	Date
Barry Song	b949395346	sched/fair: Optimize test_idle_cores() for !SMT update_idle_core() is only done for the case of sched_smt_present. but test_idle_cores() is done for all machines even those without SMT. This can contribute to up 8%+ hackbench performance loss on a machine like kunpeng 920 which has no SMT. This patch removes the redundant test_idle_cores() for !SMT machines. Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get an average. $ numactl -N 0 hackbench -p -T -l 20000 -g $1 The below is the result of hackbench w/ and w/o this patch: g= 2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 +4.1% +8.3% +7.3% +6.3% Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Change-Id: I0dd9363d2b8da9dda0bed205a5ddc36f75fabeef Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> (cherry picked from commit 7c201829c9c1e1ebb1384de66e02b8249d83167e) Signed-off-by: TogoFire <togofire@mailfence.com> Signed-off-by: onettboots <blackcocopet@gmail.com>	2024-12-03 21:50:09 +01:00
John Vincent	fff787f527	sched: fair: Reduce runtime allocated for tasks constrained by CFS bandwidth A bunch of kernels for desktop Linux have been reducing this value to improve interactivity. From Zen[1] to CachyOS[2]. There have been attempts to reduce it on Android as well. Experiment with reducing the CFS bandwidth slice to 4 msec, 1 less from the default. This is something I honestly don't want userspace to touch so keep it out from sysfs and modify it from the kernel directly instead. I honestly think that the 'interactivity' benefits (if it does hold water) of this change should be reflected on all performance modes on FreshROMs. Test for performance and battery life. [1]: https://github.com/zen-kernel/zen-kernel/commit/7de2596b35ac1db [2]: https://github.com/CachyOS/linux/blob/base-5.18/kernel/sched/fair.c Signed-off-by: John Vincent <git@tenseventyseven.cf>	2024-12-03 21:42:01 +01:00
Tyler Nijmeh	9341ba576e	sched: Process new forks before processing their parent This should let brand new tasks launch marginally faster. Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> Signed-off-by: Dušan Uverić <dusan.uveric9@gmail.com> Signed-off-by: sohamxda7 <sensoham135@gmail.com> Signed-off-by: Peppe289 <gsperanza204@gmail.com> Signed-off-by: dodyirawan85 <40514988+dodyirawan85@users.noreply.github.com>	2024-12-03 21:39:00 +01:00
Pierre Gondois	da0a9d1d3a	sched/fair: Use all little CPUs for CPU-bound workloads commit 3af7524b14198f5159a86692d57a9f28ec9375ce upstream. Running N CPU-bound tasks on an N CPUs platform: - with asymmetric CPU capacity - not being a DynamIq system (i.e. having a PKG level sched domain without the SD_SHARE_PKG_RESOURCES flag set) .. might result in a task placement where two tasks run on a big CPU and none on a little CPU. This placement could be more optimal by using all CPUs. Testing platform: Juno-r2: - 2 big CPUs (1-2), maximum capacity of 1024 - 4 little CPUs (0,3-5), maximum capacity of 383 Testing workload ([1]): Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks is affine to a CPU, except for: - one little CPU which is left idle. - one big CPU which has 2 tasks affine. After the 100ms (step 2), remove the cpumask affinity. Behavior before the patch: During step 2, the load balancer running from the idle CPU tags sched domains as: - little CPUs: 'group_has_spare'. Cf. group_has_capacity() and group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs sched-domain, and the idle CPU provides enough spare capacity regarding the imbalance_pct - big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs sched-domain, so the following path is used: group_is_overloaded() \-if (sgs->sum_nr_running <= sgs->group_weight) return true; The following path which would change the migration type to 'migrate_task' is not taken: calculate_imbalance() \-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) as the local group has some spare capacity, so the imbalance is not 0. The migration type requested is 'migrate_util' and the busiest runqueue is the big CPU's runqueue having 2 tasks (each having a utilization of 512). The idle little CPU cannot pull one of these task as its capacity is too small for the task. The following path is used: detach_tasks() \-case migrate_util: \-if (util > env->imbalance) goto next; After the patch: As the number of failed balancing attempts grows (with 'nr_balance_failed'), progressively make it easier to migrate a big task to the idling little CPU. A similar mechanism is used for the 'migrate_load' migration type. Improvement: Running the testing workload [1] with the step 2 representing a ~10s load for a big CPU: Before patch: ~19.3s After patch: ~18s (-6.7%) Similar issue reported at: https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/ Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:13 +01:00
Tejun Heo	799aef6e9d	sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks commit d329605287020c3d1c3b0dadc63d8208e7251382 upstream. When a task's weight is being changed, set_load_weight() is called with @update_load set. As weight changes aren't trivial for the fair class, set_load_weight() calls fair.c::reweight_task() for fair class tasks. However, set_load_weight() first tests task_has_idle_policy() on entry and skips calling reweight_task() for SCHED_IDLE tasks. This is buggy as SCHED_IDLE tasks are just fair tasks with a very low weight and they would incorrectly skip load, vlag and position updates. Fix it by updating reweight_task() to take struct load_weight as idle weight can't be expressed with prio and making set_load_weight() call reweight_task() for SCHED_IDLE tasks too when @update_load is set. Fixes: 9059393e4ec1 ("sched/fair: Use reweight_entity() for set_user_nice()") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v4.15+ Link: http://lkml.kernel.org/r/20240624102331.GI31592@noisy.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:12 +01:00
friedrich420	5afb8f94f1	Kernel/sched: Reduce Latency [Pafcholini] Signed-off-by: HolyAngel <slverwolf@gmail.com> Signed-off-by: Salllz <sal235222727@gmail.com> Signed-off-by: alanndz <alanndz7@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Little-W <1405481963@qq.com>	2024-11-19 18:05:31 +01:00
Sultan Alsawaf	419052d8e5	sched/fair: Compile out NUMA code entirely when NUMA is disabled Scheduler code is very hot and every little optimization counts. Instead of constantly checking sched_numa_balancing when NUMA is disabled, compile it out. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 18:05:24 +01:00
Clement Courbet	d4b05cdad5	sched: Optimize __calc_delta() A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com	2024-11-19 18:05:19 +01:00
Sultan Alsawaf	fa6b06bf46	sched/fair: Always update CPU capacity when load balancing Limiting CPU capacity updates, which are quite cheap, results in worse balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE). This causes opportunistic placement decisions to be skewed using stale CPU capacity data, and when a CPU isn't idling much, its capacity suffers from even more staleness since the only exception to the 100 ms capacity update ratelimit is a CPU exiting idle. Since the capacity updates are cheap, always do it when load balancing in order to improve opportunistic task placement decisions. Change-Id: If1d451ce742fd093010057e31e71012d47fad70a Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2024-11-19 17:34:49 +01:00
Gabriel2392	7ed7ee9edf	Import A536BXXU9EXDC	2024-06-15 16:02:09 -03:00

10 commits