Queuing wake-ups is not measurably beneficial in any way. Disable it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
[Flopster101: Apply to new way of setting features.]
Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
(cherry picked from commit 8ddf75b4fb1d7b54a795c1dc70bf480a5f049603)
(cherry picked from commit dbf96ce6987d4361b4135124b81cb40b269366c5)
(cherry picked from commit 3291d145fade85cef2830b9d28fe1c90e154ba9c)
Given that a CPU's clock is gated at even the shallowest idle state,
waiting until a CPU idles at least once before reducing its frequency is
putting the cart before the horse. For long-running workloads with low
compute needs, requiring an idle call since the last frequency update to
lower the CPU's frequency results in significantly increased energy usage.
Given that there is already a mechanism in place to ratelimit frequency
changes, this heuristic is wholly unnecessary.
Allow single-CPU performance domains to drop their frequency without
requiring an idle call in between to improve energy. Right off the bat,
this reduces CPU power consumption by 7.5% playing a cat gif in Firefox on
a Pixel 8 (270 mW -> 250 mW). And there is no visible loss of performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
* This is a modified version of schedutil, introducing two new tunables: "efficient_freq" and "up_delay".
* Only raise cpufreq to the non-efficient one (higher than effcient frequencies) if the governor keeps requiring non-efficient frequencies for more than up_delay time.
* Override the new frequencies with the efficient one if the consecutive request time doesn't reach up_delay.
* The two tunables support multiple args, e.g. you can set "1248000 1401600" for "efficient_freq" and set "50 60" for "up_delay", which means it would wait 50ms before raising the frequency to 1248mhz and wait for 60ms before raising the frequency to 1401mhz.
[Flopster101: move the kconfig entry to the proper section.]
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_flush() instead of
call_rcu(). The arrival of a non-lazy call_rcu_flush() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_flush() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_flush() in order to revert
to the old behavior.
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This imports a couple of memory management hacks used by Oplus (Oppo, Realme, and OnePlus) on their recent devices. This patch does the following:
- Add a separate swappiness (default 60) value to use if task isn't handled by kswapd
- Don't throttle tasks with OOM value < 0
- Increase zsmalloc's maximum page order
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com>
[TenSeventy7: Make the Kconfig more descriptive]
Signed-off-by: John Vincent <git@tenseventyseven.cf>
[Flopster101: Adapt to 5.10. Hardcode ISOLATED_BITS value and don't set ZS_MAX_ZSPAGE_ORDER]
Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU
to using local workqueses and removed power efficiency flag from them.
This caused a performance regression that can be observed in Geekbench 5
after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from
760/2500 to 620/2300 (single/multi core respectively).
Add WQ_POWER_EFFICIENT flag to avoid this regression.
Change-Id: I2c4f41faa55548f9e81a1c0cbe10703948062d89
These are used in Android.
Promote these to disable CONFIG_SCHED_DEBUG.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
[0ctobot: Adapted for 4.19]
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Change-Id: I8053176882e155926769939de15da375e7d548a0
After a dedicated grace-period workqueue was added to RCU in order to
benefit from rescuer threads, the relevant workers were moved to the new
workqueue away from system_power_efficient_wq. The old workqueue was
unbound, which is desirable for performance reasons. Making the workers
bound measurably regressed performance, so make them unbound again.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev>
(cherry picked from commit 139be6da6ee92c8b0aa2d19081fa230435082054)
(cherry picked from commit 3428cd70087fe5fd538d96e89f8f12266aacc2d1)
(cherry picked from commit b4e1beb3e73ccd6f341f95020becfff0b7139745)
(cherry picked from commit 428f05ee3318ed4314317bb697abec26a4bcc930)
commit 1be59c97c83ccd67a519d8a49486b3a8a73ca28a upstream.
An UAF can happen when /proc/cpuset is read as reported in [1].
This can be reproduced by the following methods:
1.add an mdelay(1000) before acquiring the cgroup_lock In the
cgroup_path_ns function.
2.$cat /proc/<pid>/cpuset repeatly.
3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/
$umount /sys/fs/cgroup/cpuset/ repeatly.
The race that cause this bug can be shown as below:
(umount) | (cat /proc/<pid>/cpuset)
css_release | proc_cpuset_show
css_release_work_fn | css = task_get_css(tsk, cpuset_cgrp_id);
css_free_rwork_fn | cgroup_path_ns(css->cgroup, ...);
cgroup_destroy_root | mutex_lock(&cgroup_mutex);
rebind_subsystems |
cgroup_free_root |
| // cgrp was freed, UAF
| cgroup_path_ns_locked(cgrp,..);
When the cpuset is initialized, the root node top_cpuset.css.cgrp
will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will
allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated
&cgroup_root.cgrp. When the umount operation is executed,
top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp.
The problem is that when rebinding to cgrp_dfl_root, there are cases
where the cgroup_root allocated by setting up the root for cgroup v1
is cached. This could lead to a Use-After-Free (UAF) if it is
subsequently freed. The descendant cgroups of cgroup v1 can only be
freed after the css is released. However, the css of the root will never
be released, yet the cgroup_root should be freed when it is unmounted.
This means that obtaining a reference to the css of the root does
not guarantee that css.cgrp->root will not be freed.
Fix this problem by using rcu_read_lock in proc_cpuset_show().
As cgroup_root is kfree_rcu after commit d23b5c577715
("cgroup: Make operations on the cgroup root_list RCU safe"),
css->cgroup won't be freed during the critical section.
To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to
replace task_get_css with task_css.
[1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although __mutex_init() already contains a hook, but this
function may be called before the mutex_init hook is registered,
causing mutex's oem_data to be uninitialized and causing unpredictable errors.
Bug: 352181884
Change-Id: I04378d6668fb4e7b93c11d930ac46aae484fc835
Signed-off-by: zhujingpeng <zhujingpeng@vivo.com>
(cherry picked from commit 96d66062d0767aeafb690ce014ec91785820d62b)
(cherry picked from commit 4c213b2ea1f58bc640894684f65c92b6cdee522d)
Add initialization for rwsem's oem_data and vendor_data.
The __init_rwsem() already contains a hook, but this function
may be called before the rwsem_init hook is registered,
causing some rwsem's oem_data to be uninitialized and
causing unpredictable errors
Bug: 351133539
Change-Id: I7bbb83894d200102bc7d84e91678f164529097a0
Signed-off-by: zhujingpeng <zhujingpeng@vivo.com>
(cherry picked from commit aaca6b10f1a352dec4596548396f590500f2001b)
(cherry picked from commit 1a5cffcbb543099802723c420f2f538b12619a72)
When a writer has set sem->block and is waiting for active readers
to complete, we still allow some specific new readers to entry
the critical section, which can help prevent priority inversion
from impacting system responsiveness and performance.
Bug: 334851707
Change-Id: I9e2a7df1efb326763487423d64bcf74d8dec23f8
Signed-off-by: zhujingpeng <zhujingpeng@vivo.com>
(cherry picked from commit 458cdb59f7719f81504d392daa95a8c99c30c276)
(cherry picked from commit b74701f8daa1a415c3ea2e819237d1bff59736eb)
these hooks are required by the following features:
1.For rwsem readers, currently only the latest reader will be
recorded in sem->owner. We add hooks to record all readers which
have acquired the lock, once there are UX threads blocked in the
rwsem waiting list, these read_owners will be given high
priority in scheduling.
2.For rwsem writer, when a writer acquires the lock, we check
whether there are UX threads blocked in the rwsem wait list. If
so, we give this writer a high priority in scheduling so that it
can release the lock as soon as possible.
Both of these features can optimize the priority inversion
problem caused by rwsem and improve system responsiveness and
performance.
There is already a hook android_vh_rwsem_set_owner in rwsem_set_owner() to record writer owned, so the hook android_vh_record_rwsem_writer_owned in cherry pick is removed.
Bug: 335408185
Change-Id: I82a6fbb6acd2ce05d049e686b61e34e4d3b39a5e
Signed-off-by: zhujingpeng <zhujingpeng@vivo.com>
[jstultz: Rebased and resolved minor conflict]
Signed-off-by: John Stultz <jstultz@google.com>
(cherry picked from commit 188c41744ddcccef6daf6dcd4d6a444dabdc2f94)
(cherry picked from commit 869fc79d3abf1a1bcd97925fa3e0ee7188bb737c)
[ Upstream commit bd44ca3de49cc1badcff7a96010fa2c64f04868c ]
Currently the dma debugging code can end up indirectly calling printk
under the radix_lock. This happens when a radix tree node allocation
fails.
This is a problem because the printk code, when used together with
netconsole, can end up inside the dma debugging code while trying to
transmit a message over netcons.
This creates the possibility of either a circular deadlock on the same
CPU, with that CPU trying to grab the radix_lock twice, or an ABBA
deadlock between different CPUs, where one CPU grabs the console lock
first and then waits for the radix_lock, while the other CPU is holding
the radix_lock and is waiting for the console lock.
The trace captured by lockdep is of the ABBA variant.
-> #2 (&dma_entry_hash[i].lock){-.-.}-{2:2}:
_raw_spin_lock_irqsave+0x5a/0x90
debug_dma_map_page+0x79/0x180
dma_map_page_attrs+0x1d2/0x2f0
bnxt_start_xmit+0x8c6/0x1540
netpoll_start_xmit+0x13f/0x180
netpoll_send_skb+0x20d/0x320
netpoll_send_udp+0x453/0x4a0
write_ext_msg+0x1b9/0x460
console_flush_all+0x2ff/0x5a0
console_unlock+0x55/0x180
vprintk_emit+0x2e3/0x3c0
devkmsg_emit+0x5a/0x80
devkmsg_write+0xfd/0x180
do_iter_readv_writev+0x164/0x1b0
vfs_writev+0xf9/0x2b0
do_writev+0x6d/0x110
do_syscall_64+0x80/0x150
entry_SYSCALL_64_after_hwframe+0x4b/0x53
-> #0 (console_owner){-.-.}-{0:0}:
__lock_acquire+0x15d1/0x31a0
lock_acquire+0xe8/0x290
console_flush_all+0x2ea/0x5a0
console_unlock+0x55/0x180
vprintk_emit+0x2e3/0x3c0
_printk+0x59/0x80
warn_alloc+0x122/0x1b0
__alloc_pages_slowpath+0x1101/0x1120
__alloc_pages+0x1eb/0x2c0
alloc_slab_page+0x5f/0x150
new_slab+0x2dc/0x4e0
___slab_alloc+0xdcb/0x1390
kmem_cache_alloc+0x23d/0x360
radix_tree_node_alloc+0x3c/0xf0
radix_tree_insert+0xf5/0x230
add_dma_entry+0xe9/0x360
dma_map_page_attrs+0x1d2/0x2f0
__bnxt_alloc_rx_frag+0x147/0x180
bnxt_alloc_rx_data+0x79/0x160
bnxt_rx_skb+0x29/0xc0
bnxt_rx_pkt+0xe22/0x1570
__bnxt_poll_work+0x101/0x390
bnxt_poll+0x7e/0x320
__napi_poll+0x29/0x160
net_rx_action+0x1e0/0x3e0
handle_softirqs+0x190/0x510
run_ksoftirqd+0x4e/0x90
smpboot_thread_fn+0x1a8/0x270
kthread+0x102/0x120
ret_from_fork+0x2f/0x40
ret_from_fork_asm+0x11/0x20
This bug is more likely than it seems, because when one CPU has run out
of memory, chances are the other has too.
The good news is, this bug is hidden behind the CONFIG_DMA_API_DEBUG, so
not many users are likely to trigger it.
Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Konstantin Ovsepian <ovs@meta.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit cc5645fddb0ce28492b15520306d092730dffa48 upstream.
There is a possibility of buffer overflow in
show_rcu_tasks_trace_gp_kthread() if counters, passed
to sprintf() are huge. Counter numbers, needed for this
are unrealistically high, but buffer overflow is still
possible.
Use snprintf() with buffer size instead of sprintf().
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: edf3775f0ad6 ("rcu-tasks: Add count for idle tasks on offline CPUs")
Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Vamsi Krishna Brahmajosyula <vamsi-krishna.brahmajosyula@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 57b56d16800e8961278ecff0dc755d46c4575092 ]
The writing of css->cgroup associated with the cgroup root in
rebind_subsystems() is currently protected only by cgroup_mutex.
However, the reading of css->cgroup in both proc_cpuset_show() and
proc_cgroup_show() is protected just by css_set_lock. That makes the
readers susceptible to racing problems like data tearing or caching.
It is also a problem that can be reported by KCSAN.
This can be fixed by using READ_ONCE() and WRITE_ONCE() to access
css->cgroup. Alternatively, the writing of css->cgroup can be moved
under css_set_lock as well which is done by this patch.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 77aeb1b685f9db73d276bad4bb30d48505a6fd23 ]
For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by
INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for
the debug check in __init_work() to work correctly.
But this lacks the counterpart to remove the tracked object from debug
objects again, which will cause a debug object warning once the stack is
freed.
Add the missing destroy_work_on_stack() invocation to cure that.
[ tglx: Massaged changelog ]
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2ab9d830262c132ab5db2f571003d80850d56b2a upstream.
Ole reported that event->mmap_mutex is strictly insufficient to
serialize the AUX buffer, add a per RB mutex to fully serialize it.
Note that in the lock order comment the perf_event::mmap_mutex order
was already wrong, that is, it nesting under mmap_lock is not new with
this patch.
Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams")
Reported-by: Ole <ole@binarygecko.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d33d26036a0274b472299d7dcdaa5fb34329f91b upstream.
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held. In the
good case it returns with the lock held and in the deadlock case it emits a
warning and goes into an endless scheduling loop with the lock held, which
triggers the 'scheduling in atomic' warning.
Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning
and dropping into the schedule for ever loop.
[ tglx: Moved unlock before the WARN(), removed the pointless comment,
massaged changelog, added Fixes tag ]
Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter")
Signed-off-by: Roland Xu <mu001999@outlook.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c4441ca86afe4814039ee1b32c39d833c1a16bbc ]
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where
error is a result of bpf_adj_branches(), and thus should be always 0
However, if for any reason it is not 0, then it will be converted to
boolean by WARN_ON_ONCE and returned to user space as 1, not an actual
error value. Fix this by returning the original err after the WARN check.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c5e3a41187ac01425f5ad1abce927905e4ac44e4 ]
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.
[ 148.133411][ T5509] =====================================================
[ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
[ 148.137819][ T5509]
[ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
[ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.143823][ T5509] =====================================================
Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
Stable-dep-of: 98feccbf32cf ("tracing: Prevent bad count for tracing_cpumask_write")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8421d4c8762bd022cb491f2f0f7019ef51b4f0a7 upstream.
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing
bpf_link_type_strs[link->type] may result in an out-of-bounds access.
To spot such missed invocations early in the future, checking the
validity of link->type in bpf_link_show_fdinfo() and emitting a warning
when such invocations are missed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
[ shung-hsi.yu: break up existing seq_printf() call since commit 68b04864ca42
("bpf: Create links for BPF struct_ops maps.") is not present ]
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d685d55dfc86b1a4bdcec77c3c1f8a83f181264e ]
Make sure the trace_kprobe's module notifer callback function is called
after jump_label's callback is called. Since the trace_kprobe's callback
eventually checks jump_label address during registering new kprobe on
the loading module, jump_label must be updated before this registration
happens.
Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/
Fixes: 614243181050 ("tracing/kprobes: Support module init function probing")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0ef8047b737d7480a5d4c46d956e97c190f13050 upstream.
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e9bd9c498cb0f5843996dbe5cbce7a1836a83c70 upstream.
Range propagation must not affect subreg_def marks, otherwise the
following example is rewritten by verifier incorrectly when
BPF_F_TEST_RND_HI32 flag is set:
0: call bpf_ktime_get_ns call bpf_ktime_get_ns
1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff
2: w1 = w0 rewrites w1 = w0
3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r)
4: r1 >>= 32 r11 <<= 32 (r)
5: r0 = r1 r1 |= r11 (r)
6: exit; if w0 < 0xa goto pc+0
r1 >>= 32
r0 = r1
exit
(or zero extension of w1 at (2) is missing for architectures that
require zero extension for upper register half).
The following happens w/o this patch:
- r0 is marked as not a subreg at (0);
- w1 is marked as subreg at (2);
- w1 subreg_def is overridden at (3) by copy_register_state();
- w1 is read at (5) but mark_insn_zext() does not mark (2)
for zero extension, because w1 subreg_def is not set;
- because of BPF_F_TEST_RND_HI32 flag verifier inserts random
value for hi32 bits of (2) (marked (r));
- this random value is read at (5).
Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@gmail.com
Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com
[ shung-hsi.yu: sync_linked_regs() was called find_equal_scalars() before commit
4bf79f9be434 ("bpf: Track equal scalars history on per-instruction level"), and
modification is done because there is only a single call to
copy_register_state() before commit 98d7ca374ba4 ("bpf: Track delta between
"linked" registers."). ]
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I have a RT task X at a high priority and cyclictest on each CPU with
lower priority than X's. If X is active and each CPU wakes their own
cylictest thread then it ends in a longer rto_push storm.
A random CPU determines via balance_rt() that the CPU on which X is
running needs to push tasks. X has the highest priority, cyclictest is
next in line so there is nothing that can be done since the task with
the higher priority is not touched.
tell_cpu_to_push() increments rto_loop_next and schedules
rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
loop counter and do the same. Once rto_push_irq_work_func() is active it
does nothing because it has _no_ pushable tasks on its runqueue. Then
checks rto_next_cpu() and decides to queue irq_work on the local CPU
because another CPU requested a push by incrementing the counter.
I have traces where ~30 CPUs request this ~3 times each before it
finally ends. This greatly increases X's runtime while X isn't making
much progress.
Teach rto_next_cpu() to only return CPUs which also have tasks on their
runqueue which can be pushed away. This does not reduce the
tell_cpu_to_push() invocations (rto_loop_next counter increments) but
reduces the amount of issued rto_push_irq_work_func() if nothing can be
done. As the result the overloaded CPU is blocked less often.
There are still cases where the "same job" is repeated several times
(for instance the current CPU needs to resched but didn't yet because
the irq-work is repeated a few times and so the old task remains on the
CPU) but the majority of request end in tell_cpu_to_push() before an IPI
is issued.
Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU
to using local workqueses and removed power efficiency flag from them.
This caused a performance regression that can be observed in Geekbench 5
after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from
760/2500 to 620/2300 (single/multi core respectively).
Add WQ_POWER_EFFICIENT flag to avoid this regression.
Change-Id: I2c4f41faa55548f9e81a1c0cbe10703948062d89
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.
The maximum of detached tasks remained the same as before.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
This is a forward port of pa/890483 with modifications from the original
patch due to changes in sched/softirq.c which applies the same logic.
We're finding audio glitches caused by audio-producing RT tasks
that are either interrupted to handle softirq's or that are
scheduled onto cpu's that are handling softirq's.
In a previous patch, we attempted to catch many cases of the
latter problem, but it's clear that we are still losing
significant numbers of races in some apps.
This patch attempts to address the following problem::
It attempts to reduce the most common windows in which
we lose the race between scheduling an RT task on a remote
core and starting to handle softirq's on that core.
We still lose some races, but we lose significantly fewer.
(And we don't want to introduce any heavyweight forms
of synchronization on these paths.)
Bug: 64912585
Bug: 136771796
Bug: 144961676
Change-Id: Ida89a903be0f1965552dd0e84e67ef1d3158c7d8
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
(cherry picked from commit 3c2569f21be6b7c457b389821b098d30bd03b84c)
(cherry picked from commit f9cf84966a315169f3f505993b9e15b6c53f2bbb)
(cherry picked from commit e6edaa29efee9e2f2451ab50030d8cde19ab227b)
(cherry picked from commit db52e23658eaf304ba2333f04d910313c0c21e05)
(cherry picked from commit aa2d4d07929d1c06c6b99ed8718394cf8e471d21)
(cherry picked from commit cefdb96edccdc3f788275d1bcef89c8dfeeb8b11)
(cherry picked from commit 5a45a0208069ac5704713b35a0858fcd5997f97a)
(cherry picked from commit 86ac137eb7399547c6cd20d60063d244091e920f)
When a new threshold breaching stall happens after a psi event was
generated and within the window duration, the new event is not
generated because the events are rate-limited to one per window. If
after that no new stall is recorded then the event will not be
generated even after rate-limiting duration has passed. This is
happening because with no new stall, window_update will not be called
even though threshold was previously breached. To fix this, record
threshold breaching occurrence and generate the event once window
duration is passed.
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com
(cherry picked from commit e6df4ead85d9da1b07dd40bd4c6d2182f3e210c4)
(cherry picked from commit 007db79713251fd67cc5ce7142942772f6074aca)
(cherry picked from commit 129c4324ce9ea8be9c528629bca72e1e1f80d3fb)
When a trigger being created, its win.start_value and win.start_time are
reset to zero. If group->total[PSI_POLL][t->state] has accumulated before,
this trigger will be fired unexpectedly in the next period, even if its
growth time does not reach its threshold.
So set the window of the new trigger to the current state value.
Signed-off-by: Hailong Liu <liuhailong@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/1648789811-3788971-1-git-send-email-liuhailong@linux.alibaba.com
(cherry picked from commit 915a087e4c47334a2f7ba2a4092c4bade0873769)
(cherry picked from commit 168af79cd0d6750ddf6c775c79833059da606c34)
(cherry picked from commit 9b3b0d7ba99e41a1159a0b9c58ff138bb51d52a1)
Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.
% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277
The PSI_CPU_FULL state is introduced by commit e7fcd7622823
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.
Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.
Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.
Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.
Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
(cherry picked from commit 890d550d7dbac7a31ecaa78732aa22be282bb6b8)
(cherry picked from commit f5187de2b75d019739caec97e8e7886a27e8554c)
(cherry picked from commit 669718aaede6df28e37f6a11c0e257d18f050b1a)
This patch addresses an issue seen where SCHED_FIFO prio 99
tasks were being woken up on a cpu where a long-running softirq
was executing, and the RT task was not being migrated, causing
long (10+ms wakeup latencies).
Prior to upstream commit 934fc3314b39 ("sched/cpupri: Remap
CPUPRI_NORMAL to MAX_RT_PRIO-1") the task->prio -> cpupri
mapping is a little ackward.
For RT tasks, its calculated as:
cpupri = MAX_RT_PRIO - prio + 1;
See:
https://android.googlesource.com/kernel/common/+/refs/heads/android13-5.10/kernel/sched/cpupri.c#39
This is added ontop of the also ackward detail that the
task->prio is inverted (RT prio99 -> 0), means the cpupri
mapping for RT tasks goes from 2->101. This makes it easy to
evaluate the cpupri incorrectly.
Which it turns out happened In commit 3adfd8e344ac ("ANDROID:
sched: avoid placing RT threads on cores handling softirqs"):
3adfd8e344%5E%21/
With the lines:
int task_pri = convert_prio(p->prio);
bool drop_nopreempts = task_pri <= MAX_RT_PRIO;
Where the added logic to decide to migrate a rt task off of a
cpu depended on this drop_nopreempts being true.
This works properly for rt tasks from prio 99 to 1, but for the
case of task->prio == 0 (userland rt prio 99 tasks) it breaks,
as the cpupri will be MAX_RT_PRIO - 0 + 1, which then gets
checked as <= MAX_RT_PRIO.
This prevents the cpu from being dropped from the scheduling
set and prevents the rt user prio 99 task from migrating, which
results in high priority rt tasks being left on cpus where long
running softirqs are executing, causing long latencies.
This patch fixes the off by one by changing the evaulation
to MAX_RT_PRIO + 1, so that user-prio 99 tasks will also be
migrated if appropriate.
Luckilly this odd cpupri mapping has been fixed upstream, making
this patch no longer necessary in 5.15 and newer kernels.
Fixes: 3adfd8e344ac ("ANDROID: sched: avoid placing RT threads on cores handling softirqs")
Signed-off-by: John Stultz <jstultz@google.com>
Change-Id: Ia2db7cd461eb4c90f5850b791de1ae95582f7530
(cherry picked from commit bec3b2f002cd13d44e32710315de91e636ecef84)
(cherry picked from commit e8e441d4a2f6b6a316d229f8ff1ebe0f40099817)
(cherry picked from commit 1c1f38f1e1fa82e09e0e3f6f941ca81127682ec3)
Use the task_current() function where appropriate.
No functional change.
Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.
However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.
This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
affined to two cores
- a large number of threads with 100% utilization are pinned to all
other cores
Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220111134659.24961-5-vincent.guittot@linaro.org