kernel_samsung_a53x

Author	SHA1	Message	Date
Sultan Alsawaf	6c7dd7bf70	sched: Disable TTWU_QUEUE Queuing wake-ups is not measurably beneficial in any way. Disable it. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> [Flopster101: Apply to new way of setting features.] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2025-01-21 21:30:11 +01:00
freak07	58659caf1a	treewide: use power efficient workingqueues (cherry picked from commit 8ddf75b4fb1d7b54a795c1dc70bf480a5f049603) (cherry picked from commit dbf96ce6987d4361b4135124b81cb40b269366c5) (cherry picked from commit 3291d145fade85cef2830b9d28fe1c90e154ba9c)	2025-01-21 21:28:39 +01:00
ztc1997	7f393c3513	cpufreq: {schedutil, schedhorizon}: Allow single-CPU frequency to drop without idling Given that a CPU's clock is gated at even the shallowest idle state, waiting until a CPU idles at least once before reducing its frequency is putting the cart before the horse. For long-running workloads with low compute needs, requiring an idle call since the last frequency update to lower the CPU's frequency results in significantly increased energy usage. Given that there is already a mechanism in place to ratelimit frequency changes, this heuristic is wholly unnecessary. Allow single-CPU performance domains to drop their frequency without requiring an idle call in between to improve energy. Right off the bat, this reduces CPU power consumption by 7.5% playing a cat gif in Firefox on a Pixel 8 (270 mW -> 250 mW). And there is no visible loss of performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>	2025-01-21 21:28:32 +01:00
ztc1997	0e4add4b71	schedhorizon: Sync with schedutil	2025-01-21 21:28:28 +01:00
ztc1997	2a7073f6d6	schedhorizon: Introduce schedhorizon cpufreq governor * This is a modified version of schedutil, introducing two new tunables: "efficient_freq" and "up_delay". * Only raise cpufreq to the non-efficient one (higher than effcient frequencies) if the governor keeps requiring non-efficient frequencies for more than up_delay time. * Override the new frequencies with the efficient one if the consecutive request time doesn't reach up_delay. * The two tunables support multiple args, e.g. you can set "1248000 1401600" for "efficient_freq" and set "50 60" for "up_delay", which means it would wait 50ms before raising the frequency to 1248mhz and wait for 60ms before raising the frequency to 1401mhz. [Flopster101: move the kconfig entry to the proper section.]	2025-01-21 21:28:17 +01:00
Ksawlii	1b696e9b5d	Revert "workqueue: Make queue_rcu_work() use call_rcu_flush()" This reverts commit `d9f15863ae`.	2025-01-19 20:29:35 +01:00
Uladzislau Rezki	d9f15863ae	workqueue: Make queue_rcu_work() use call_rcu_flush() Earlier commits in this series allow battery-powered systems to build their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option. This Kconfig option causes call_rcu() to delay its callbacks in order to batch them. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_flush() instead of call_rcu(). The arrival of a non-lazy call_rcu_flush() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_flush() for the few places where laziness is inappropriate. And another call_rcu() instance that cannot be lazy is the one in queue_rcu_work(), given that callers to queue_rcu_work() are not necessarily OK with long delays. Therefore, make queue_rcu_work() use call_rcu_flush() in order to revert to the old behavior. Signed-off-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2025-01-19 20:17:36 +01:00
Dark-Matter7232	094fe4e123	mm: Import memory management hacks from Oplus This imports a couple of memory management hacks used by Oplus (Oppo, Realme, and OnePlus) on their recent devices. This patch does the following: - Add a separate swappiness (default 60) value to use if task isn't handled by kswapd - Don't throttle tasks with OOM value < 0 - Increase zsmalloc's maximum page order Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com> [TenSeventy7: Make the Kconfig more descriptive] Signed-off-by: John Vincent <git@tenseventyseven.cf> [Flopster101: Adapt to 5.10. Hardcode ISOLATED_BITS value and don't set ZS_MAX_ZSPAGE_ORDER] Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2025-01-19 20:17:06 +01:00
Yaroslav Furman	630769e32b	rcu: Fix a performance regression. Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU to using local workqueses and removed power efficiency flag from them. This caused a performance regression that can be observed in Geekbench 5 after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from 760/2500 to 620/2300 (single/multi core respectively). Add WQ_POWER_EFFICIENT flag to avoid this regression. Change-Id: I2c4f41faa55548f9e81a1c0cbe10703948062d89	2025-01-19 20:05:47 +01:00
Park Ju Hyung	8d281d0b97	sysctl: promote several nodes out of CONFIG_SCHED_DEBUG These are used in Android. Promote these to disable CONFIG_SCHED_DEBUG. Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> [0ctobot: Adapted for 4.19] Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Change-Id: I8053176882e155926769939de15da375e7d548a0	2025-01-19 19:54:03 +01:00
Sultan Alsawaf	bc0e96e7ae	rcu: Make the grace period workers unbound again After a dedicated grace-period workqueue was added to RCU in order to benefit from rescuer threads, the relevant workers were moved to the new workqueue away from system_power_efficient_wq. The old workqueue was unbound, which is desirable for performance reasons. Making the workers bound measurably regressed performance, so make them unbound again. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Panchajanya1999 <kernel@panchajanya.dev> (cherry picked from commit 139be6da6ee92c8b0aa2d19081fa230435082054) (cherry picked from commit 3428cd70087fe5fd538d96e89f8f12266aacc2d1) (cherry picked from commit b4e1beb3e73ccd6f341f95020becfff0b7139745) (cherry picked from commit 428f05ee3318ed4314317bb697abec26a4bcc930)	2025-01-19 17:13:37 +01:00
Chen Ridong	ddf2093c3e	cgroup/cpuset: Prevent UAF in proc_cpuset_show() commit 1be59c97c83ccd67a519d8a49486b3a8a73ca28a upstream. An UAF can happen when /proc/cpuset is read as reported in [1]. This can be reproduced by the following methods: 1.add an mdelay(1000) before acquiring the cgroup_lock In the cgroup_path_ns function. 2.$cat /proc/<pid>/cpuset repeatly. 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/ $umount /sys/fs/cgroup/cpuset/ repeatly. The race that cause this bug can be shown as below: (umount) \| (cat /proc/<pid>/cpuset) css_release \| proc_cpuset_show css_release_work_fn \| css = task_get_css(tsk, cpuset_cgrp_id); css_free_rwork_fn \| cgroup_path_ns(css->cgroup, ...); cgroup_destroy_root \| mutex_lock(&cgroup_mutex); rebind_subsystems \| cgroup_free_root \| \| // cgrp was freed, UAF \| cgroup_path_ns_locked(cgrp,..); When the cpuset is initialized, the root node top_cpuset.css.cgrp will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated &cgroup_root.cgrp. When the umount operation is executed, top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp. The problem is that when rebinding to cgrp_dfl_root, there are cases where the cgroup_root allocated by setting up the root for cgroup v1 is cached. This could lead to a Use-After-Free (UAF) if it is subsequently freed. The descendant cgroups of cgroup v1 can only be freed after the css is released. However, the css of the root will never be released, yet the cgroup_root should be freed when it is unmounted. This means that obtaining a reference to the css of the root does not guarantee that css.cgrp->root will not be freed. Fix this problem by using rcu_read_lock in proc_cpuset_show(). As cgroup_root is kfree_rcu after commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe"), css->cgroup won't be freed during the critical section. To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to replace task_get_css with task_css. [1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 14:59:05 +01:00
zhujingpeng	dacbe677af	ANDROID: GKI: Add initialization for mutex oem_data. Although __mutex_init() already contains a hook, but this function may be called before the mutex_init hook is registered, causing mutex's oem_data to be uninitialized and causing unpredictable errors. Bug: 352181884 Change-Id: I04378d6668fb4e7b93c11d930ac46aae484fc835 Signed-off-by: zhujingpeng <zhujingpeng@vivo.com> (cherry picked from commit 96d66062d0767aeafb690ce014ec91785820d62b) (cherry picked from commit 4c213b2ea1f58bc640894684f65c92b6cdee522d)	2025-01-19 14:56:36 +01:00
zhujingpeng	b406bff5bd	ANDROID: GKI: Add initialization for rwsem's oem_data and vendor_data. Add initialization for rwsem's oem_data and vendor_data. The __init_rwsem() already contains a hook, but this function may be called before the rwsem_init hook is registered, causing some rwsem's oem_data to be uninitialized and causing unpredictable errors Bug: 351133539 Change-Id: I7bbb83894d200102bc7d84e91678f164529097a0 Signed-off-by: zhujingpeng <zhujingpeng@vivo.com> (cherry picked from commit aaca6b10f1a352dec4596548396f590500f2001b) (cherry picked from commit 1a5cffcbb543099802723c420f2f538b12619a72)	2025-01-19 14:56:25 +01:00
zhujingpeng	5fd39502da	ANDROID: GKI: add percpu_rwsem vendor hooks When a writer has set sem->block and is waiting for active readers to complete, we still allow some specific new readers to entry the critical section, which can help prevent priority inversion from impacting system responsiveness and performance. Bug: 334851707 Change-Id: I9e2a7df1efb326763487423d64bcf74d8dec23f8 Signed-off-by: zhujingpeng <zhujingpeng@vivo.com> (cherry picked from commit 458cdb59f7719f81504d392daa95a8c99c30c276) (cherry picked from commit b74701f8daa1a415c3ea2e819237d1bff59736eb)	2025-01-19 14:55:30 +01:00
zhujingpeng	f0f9800398	ANDROID: vendor_hooks: add hooks in rwsem these hooks are required by the following features: 1.For rwsem readers, currently only the latest reader will be recorded in sem->owner. We add hooks to record all readers which have acquired the lock, once there are UX threads blocked in the rwsem waiting list, these read_owners will be given high priority in scheduling. 2.For rwsem writer, when a writer acquires the lock, we check whether there are UX threads blocked in the rwsem wait list. If so, we give this writer a high priority in scheduling so that it can release the lock as soon as possible. Both of these features can optimize the priority inversion problem caused by rwsem and improve system responsiveness and performance. There is already a hook android_vh_rwsem_set_owner in rwsem_set_owner() to record writer owned, so the hook android_vh_record_rwsem_writer_owned in cherry pick is removed. Bug: 335408185 Change-Id: I82a6fbb6acd2ce05d049e686b61e34e4d3b39a5e Signed-off-by: zhujingpeng <zhujingpeng@vivo.com> [jstultz: Rebased and resolved minor conflict] Signed-off-by: John Stultz <jstultz@google.com> (cherry picked from commit 188c41744ddcccef6daf6dcd4d6a444dabdc2f94) (cherry picked from commit 869fc79d3abf1a1bcd97925fa3e0ee7188bb737c)	2025-01-19 14:55:21 +01:00
Rik van Riel	adf576dfdf	dma-debug: avoid deadlock between dma debug vs printk and netconsole [ Upstream commit bd44ca3de49cc1badcff7a96010fa2c64f04868c ] Currently the dma debugging code can end up indirectly calling printk under the radix_lock. This happens when a radix tree node allocation fails. This is a problem because the printk code, when used together with netconsole, can end up inside the dma debugging code while trying to transmit a message over netcons. This creates the possibility of either a circular deadlock on the same CPU, with that CPU trying to grab the radix_lock twice, or an ABBA deadlock between different CPUs, where one CPU grabs the console lock first and then waits for the radix_lock, while the other CPU is holding the radix_lock and is waiting for the console lock. The trace captured by lockdep is of the ABBA variant. -> #2 (&dma_entry_hash[i].lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x5a/0x90 debug_dma_map_page+0x79/0x180 dma_map_page_attrs+0x1d2/0x2f0 bnxt_start_xmit+0x8c6/0x1540 netpoll_start_xmit+0x13f/0x180 netpoll_send_skb+0x20d/0x320 netpoll_send_udp+0x453/0x4a0 write_ext_msg+0x1b9/0x460 console_flush_all+0x2ff/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 devkmsg_emit+0x5a/0x80 devkmsg_write+0xfd/0x180 do_iter_readv_writev+0x164/0x1b0 vfs_writev+0xf9/0x2b0 do_writev+0x6d/0x110 do_syscall_64+0x80/0x150 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (console_owner){-.-.}-{0:0}: __lock_acquire+0x15d1/0x31a0 lock_acquire+0xe8/0x290 console_flush_all+0x2ea/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 _printk+0x59/0x80 warn_alloc+0x122/0x1b0 __alloc_pages_slowpath+0x1101/0x1120 __alloc_pages+0x1eb/0x2c0 alloc_slab_page+0x5f/0x150 new_slab+0x2dc/0x4e0 ___slab_alloc+0xdcb/0x1390 kmem_cache_alloc+0x23d/0x360 radix_tree_node_alloc+0x3c/0xf0 radix_tree_insert+0xf5/0x230 add_dma_entry+0xe9/0x360 dma_map_page_attrs+0x1d2/0x2f0 __bnxt_alloc_rx_frag+0x147/0x180 bnxt_alloc_rx_data+0x79/0x160 bnxt_rx_skb+0x29/0xc0 bnxt_rx_pkt+0xe22/0x1570 __bnxt_poll_work+0x101/0x390 bnxt_poll+0x7e/0x320 __napi_poll+0x29/0x160 net_rx_action+0x1e0/0x3e0 handle_softirqs+0x190/0x510 run_ksoftirqd+0x4e/0x90 smpboot_thread_fn+0x1a8/0x270 kthread+0x102/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x11/0x20 This bug is more likely than it seems, because when one CPU has run out of memory, chances are the other has too. The good news is, this bug is hidden behind the CONFIG_DMA_API_DEBUG, so not many users are likely to trigger it. Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Konstantin Ovsepian <ovs@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-19 14:53:28 +01:00
Nikita Kiryushin	5efd52730f	rcu-tasks: Fix show_rcu_tasks_trace_gp_kthread buffer overflow commit cc5645fddb0ce28492b15520306d092730dffa48 upstream. There is a possibility of buffer overflow in show_rcu_tasks_trace_gp_kthread() if counters, passed to sprintf() are huge. Counter numbers, needed for this are unrealistically high, but buffer overflow is still possible. Use snprintf() with buffer size instead of sprintf(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: edf3775f0ad6 ("rcu-tasks: Add count for idle tasks on offline CPUs") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Vamsi Krishna Brahmajosyula <vamsi-krishna.brahmajosyula@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 14:52:20 +01:00
Waiman Long	531312d37d	cgroup: Protect css->cgroup write under css_set_lock [ Upstream commit 57b56d16800e8961278ecff0dc755d46c4575092 ] The writing of css->cgroup associated with the cgroup root in rebind_subsystems() is currently protected only by cgroup_mutex. However, the reading of css->cgroup in both proc_cpuset_show() and proc_cgroup_show() is protected just by css_set_lock. That makes the readers susceptible to racing problems like data tearing or caching. It is also a problem that can be reported by KCSAN. This can be fixed by using READ_ONCE() and WRITE_ONCE() to access css->cgroup. Alternatively, the writing of css->cgroup can be moved under css_set_lock as well which is done by this patch. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-19 00:10:00 +01:00
Zqiang	11e523ae0b	smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu() [ Upstream commit 77aeb1b685f9db73d276bad4bb30d48505a6fd23 ] For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for the debug check in __init_work() to work correctly. But this lacks the counterpart to remove the tracked object from debug objects again, which will cause a debug object warning once the stack is freed. Add the missing destroy_work_on_stack() invocation to cure that. [ tglx: Massaged changelog ] Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-19 00:10:00 +01:00
Sven Schnelle	b6ad0bb927	uprobes: Use kzalloc to allocate xol area commit e240b0fde52f33670d1336697c22d90a4fe33c84 upstream. To prevent unitialized members, use kzalloc to allocate the xol area. Fixes: b059a453b1cf1 ("x86/vdso: Add mremap hook to vm_special_mapping") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240903102313.3402529-1-svens@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 00:09:58 +01:00
Peter Zijlstra	d453a33157	perf/aux: Fix AUX buffer serialization commit 2ab9d830262c132ab5db2f571003d80850d56b2a upstream. Ole reported that event->mmap_mutex is strictly insufficient to serialize the AUX buffer, add a per RB mutex to fully serialize it. Note that in the lock order comment the perf_event::mmap_mutex order was already wrong, that is, it nesting under mmap_lock is not new with this patch. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Reported-by: Ole <ole@binarygecko.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 00:09:58 +01:00
Roland Xu	6fd51c9abe	rtmutex: Drop rt_mutex::wait_lock before scheduling commit d33d26036a0274b472299d7dcdaa5fb34329f91b upstream. rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held. In the good case it returns with the lock held and in the deadlock case it emits a warning and goes into an endless scheduling loop with the lock held, which triggers the 'scheduling in atomic' warning. Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning and dropping into the schedule for ever loop. [ tglx: Moved unlock before the WARN(), removed the pointless comment, massaged changelog, added Fixes tag ] Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter") Signed-off-by: Roland Xu <mu001999@outlook.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 00:09:58 +01:00
Ksawlii	1dd5acfeba	Import susfs4ksu	2025-01-18 21:48:58 +01:00
Anton Protopopov	ba63a9f94c	bpf: fix potential error return [ Upstream commit c4441ca86afe4814039ee1b32c39d833c1a16bbc ] The bpf_remove_insns() function returns WARN_ON_ONCE(error), where error is a result of bpf_adj_branches(), and thus should be always 0 However, if for any reason it is not 0, then it will be converted to boolean by WARN_ON_ONCE and returned to user space as 1, not an actual error value. Fix this by returning the original err after the WARN check. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-15 16:29:56 +01:00
Lizhi Xu	aa6a6f1a9f	tracing: Prevent bad count for tracing_cpumask_write [ Upstream commit 98feccbf32cfdde8c722bc4587aaa60ee5ac33f0 ] If a large count is provided, it will trigger a warning in bitmap_parse_user. Also check zero for it. Cc: stable@vger.kernel.org Fixes: 9e01c1b74c953 ("cpumask: convert kernel trace functions") Link: https://lore.kernel.org/20241216073238.2573704-1-lizhi.xu@windriver.com Reported-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0aecfd34fb878546f3fd Tested-by: syzbot+0aecfd34fb878546f3fd@syzkaller.appspotmail.com Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-15 16:29:54 +01:00
Tetsuo Handa	5f9a54b11b	kernel: Initialize cpumask before parsing [ Upstream commit c5e3a41187ac01425f5ad1abce927905e4ac44e4 ] KMSAN complains that new_value at cpumask_parse_user() from write_irq_affinity() from irq_affinity_proc_write() is uninitialized. [ 148.133411][ T5509] ===================================================== [ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340 [ 148.137819][ T5509] [ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at: [ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0 [ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0 [ 148.143823][ T5509] ===================================================== Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(), any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility that find_next_bit() accesses uninitialized cpu mask variable. Fix this problem by replacing alloc_cpumask_var() with zalloc_cpumask_var(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp Stable-dep-of: 98feccbf32cf ("tracing: Prevent bad count for tracing_cpumask_write") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-15 16:29:54 +01:00
Hou Tao	bf28b1db02	bpf: Check validity of link->type in bpf_link_show_fdinfo() commit 8421d4c8762bd022cb491f2f0f7019ef51b4f0a7 upstream. If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing bpf_link_type_strs[link->type] may result in an out-of-bounds access. To spot such missed invocations early in the future, checking the validity of link->type in bpf_link_show_fdinfo() and emitting a warning when such invocations are missed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com [ shung-hsi.yu: break up existing seq_printf() call since commit 68b04864ca42 ("bpf: Create links for BPF struct_ops maps.") is not present ] Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-15 16:29:51 +01:00
Masami Hiramatsu (Google)	29c072bc19	tracing/kprobe: Make trace_kprobe's module callback called after jump_label update [ Upstream commit d685d55dfc86b1a4bdcec77c3c1f8a83f181264e ] Make sure the trace_kprobe's module notifer callback function is called after jump_label's callback is called. Since the trace_kprobe's callback eventually checks jump_label address during registering new kprobe on the loading module, jump_label must be updated before this registration happens. Link: https://lore.kernel.org/all/173387585556.995044.3157941002975446119.stgit@devnote2/ Fixes: 614243181050 ("tracing/kprobes: Support module init function probing") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-15 16:29:50 +01:00
Ksawlii	9be2607d55	Revert "treewide: patch for susfs 1.5.3" This reverts commit `c7b62da18c`.	2025-01-06 22:38:18 +01:00
Nahuel Gómez	c7b62da18c	treewide: patch for susfs 1.5.3 * Patch: 50_add_susfs_in_gki-android12-5.10.patch * Revision: ec6d3ae385059409a58d67bc2a10d762e740efcb Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2025-01-04 19:38:56 +01:00
Juergen Gross	5f757e0171	x86/static-call: provide a way to do very early static-call updates commit 0ef8047b737d7480a5d4c46d956e97c190f13050 upstream. Add static_call_update_early() for updating static-call targets in very early boot. This will be needed for support of Xen guest type specific hypercall functions. This is part of XSA-466 / CVE-2024-53241. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Co-developed-by: Peter Zijlstra <peterz@infradead.org> Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-02 17:01:18 +01:00
Eduard Zingerman	c9543019ff	bpf: sync_linked_regs() must preserve subreg_def commit e9bd9c498cb0f5843996dbe5cbce7a1836a83c70 upstream. Range propagation must not affect subreg_def marks, otherwise the following example is rewritten by verifier incorrectly when BPF_F_TEST_RND_HI32 flag is set: 0: call bpf_ktime_get_ns call bpf_ktime_get_ns 1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff 2: w1 = w0 rewrites w1 = w0 3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r) 4: r1 >>= 32 r11 <<= 32 (r) 5: r0 = r1 r1 \|= r11 (r) 6: exit; if w0 < 0xa goto pc+0 r1 >>= 32 r0 = r1 exit (or zero extension of w1 at (2) is missing for architectures that require zero extension for upper register half). The following happens w/o this patch: - r0 is marked as not a subreg at (0); - w1 is marked as subreg at (2); - w1 subreg_def is overridden at (3) by copy_register_state(); - w1 is read at (5) but mark_insn_zext() does not mark (2) for zero extension, because w1 subreg_def is not set; - because of BPF_F_TEST_RND_HI32 flag verifier inserts random value for hi32 bits of (2) (marked (r)); - this random value is read at (5). Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@gmail.com Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com [ shung-hsi.yu: sync_linked_regs() was called find_equal_scalars() before commit 4bf79f9be434 ("bpf: Track equal scalars history on per-instruction level"), and modification is done because there is only a single call to copy_register_state() before commit 98d7ca374ba4 ("bpf: Track delta between "linked" registers."). ] Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-02 17:01:17 +01:00
Nahuel Gómez	a6f250c0e8	sched: fair: tune BORE Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-18 15:10:03 +01:00
Nahuel Gómez	46aa3942ce	sched: fair: properly call reweight_task for BORE Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-18 15:09:59 +01:00
Masahito S	f299ae0372	5.4 bootup fix: Variable declaration types	2024-12-18 15:09:53 +01:00
Masahito S	0769637133	5.4 compile error fix: ISO C90 forbids mixed declarations and code	2024-12-18 15:09:49 +01:00
Masahito S	30aca2aa4e	5.4 compile error fix: Use SYSCTL_ZERO and SYSCTL_ONE	2024-12-18 15:09:47 +01:00
Masahito S	ff6bc15f2a	sched: linux5.4.y-bore5.1.0	2024-12-18 15:09:41 +01:00
Sebastian Andrzej Siewior	16ecd487af	sched/rt: Don't try push tasks if there are none. I have a RT task X at a high priority and cyclictest on each CPU with lower priority than X's. If X is active and each CPU wakes their own cylictest thread then it ends in a longer rto_push storm. A random CPU determines via balance_rt() that the CPU on which X is running needs to push tasks. X has the highest priority, cyclictest is next in line so there is nothing that can be done since the task with the higher priority is not touched. tell_cpu_to_push() increments rto_loop_next and schedules rto_push_irq_work_func() on X's CPU. The other CPUs also increment the loop counter and do the same. Once rto_push_irq_work_func() is active it does nothing because it has _no_ pushable tasks on its runqueue. Then checks rto_next_cpu() and decides to queue irq_work on the local CPU because another CPU requested a push by incrementing the counter. I have traces where ~30 CPUs request this ~3 times each before it finally ends. This greatly increases X's runtime while X isn't making much progress. Teach rto_next_cpu() to only return CPUs which also have tasks on their runqueue which can be pushed away. This does not reduce the tell_cpu_to_push() invocations (rto_loop_next counter increments) but reduces the amount of issued rto_push_irq_work_func() if nothing can be done. As the result the overloaded CPU is blocked less often. There are still cases where the "same job" is repeated several times (for instance the current CPU needs to resched but didn't yet because the irq-work is repeated a few times and so the old task remains on the CPU) but the majority of request end in tell_cpu_to_push() before an IPI is issued. Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2024-12-18 15:05:40 +01:00
Yaroslav Furman	9c41dd9383	rcu: Fix a performance regression. Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU to using local workqueses and removed power efficiency flag from them. This caused a performance regression that can be observed in Geekbench 5 after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from 760/2500 to 620/2300 (single/multi core respectively). Add WQ_POWER_EFFICIENT flag to avoid this regression. Change-Id: I2c4f41faa55548f9e81a1c0cbe10703948062d89	2024-12-18 15:05:33 +01:00
Vincent Guittot	0f92567070	sched/fair: Make sure to try to detach at least one movable task During load balance, we try at most env->loop_max time to move a task. But it can happen that the loop_max LRU tasks (ie tail of the cfs_tasks list) can't be moved to dst_cpu because of affinity. In this case, loop in the list until we found at least one. The maximum of detached tasks remained the same as before. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org	2024-12-18 15:03:21 +01:00
Miguel de Dios	8b98e08a04	sched: reduce softirq conflicts with RT This is a forward port of pa/890483 with modifications from the original patch due to changes in sched/softirq.c which applies the same logic. We're finding audio glitches caused by audio-producing RT tasks that are either interrupted to handle softirq's or that are scheduled onto cpu's that are handling softirq's. In a previous patch, we attempted to catch many cases of the latter problem, but it's clear that we are still losing significant numbers of races in some apps. This patch attempts to address the following problem:: It attempts to reduce the most common windows in which we lose the race between scheduling an RT task on a remote core and starting to handle softirq's on that core. We still lose some races, but we lose significantly fewer. (And we don't want to introduce any heavyweight forms of synchronization on these paths.) Bug: 64912585 Bug: 136771796 Bug: 144961676 Change-Id: Ida89a903be0f1965552dd0e84e67ef1d3158c7d8 Signed-off-by: Miguel de Dios <migueldedios@google.com> Signed-off-by: Jimmy Shiu <jimmyshiu@google.com> (cherry picked from commit 3c2569f21be6b7c457b389821b098d30bd03b84c) (cherry picked from commit f9cf84966a315169f3f505993b9e15b6c53f2bbb) (cherry picked from commit e6edaa29efee9e2f2451ab50030d8cde19ab227b) (cherry picked from commit db52e23658eaf304ba2333f04d910313c0c21e05) (cherry picked from commit aa2d4d07929d1c06c6b99ed8718394cf8e471d21) (cherry picked from commit cefdb96edccdc3f788275d1bcef89c8dfeeb8b11) (cherry picked from commit 5a45a0208069ac5704713b35a0858fcd5997f97a) (cherry picked from commit 86ac137eb7399547c6cd20d60063d244091e920f)	2024-12-18 15:03:18 +01:00
Zhaoyang Huang	f11d790f3a	psi: fix possible trigger missing in the window When a new threshold breaching stall happens after a psi event was generated and within the window duration, the new event is not generated because the events are rate-limited to one per window. If after that no new stall is recorded then the event will not be generated even after rate-limiting duration has passed. This is happening because with no new stall, window_update will not be called even though threshold was previously breached. To fix this, record threshold breaching occurrence and generate the event once window duration is passed. Suggested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com (cherry picked from commit e6df4ead85d9da1b07dd40bd4c6d2182f3e210c4) (cherry picked from commit 007db79713251fd67cc5ce7142942772f6074aca) (cherry picked from commit 129c4324ce9ea8be9c528629bca72e1e1f80d3fb)	2024-12-18 15:01:14 +01:00
Hailong Liu	e48c85b067	psi: Fix trigger being fired unexpectedly at initial When a trigger being created, its win.start_value and win.start_time are reset to zero. If group->total[PSI_POLL][t->state] has accumulated before, this trigger will be fired unexpectedly in the next period, even if its growth time does not reach its threshold. So set the window of the new trigger to the current state value. Signed-off-by: Hailong Liu <liuhailong@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/1648789811-3788971-1-git-send-email-liuhailong@linux.alibaba.com (cherry picked from commit 915a087e4c47334a2f7ba2a4092c4bade0873769) (cherry picked from commit 168af79cd0d6750ddf6c775c79833059da606c34) (cherry picked from commit 9b3b0d7ba99e41a1159a0b9c58ff138bb51d52a1)	2024-12-18 15:01:10 +01:00
Chengming Zhou	09558667e8	sched/psi: report zeroes for CPU full at the system level Martin find it confusing when look at the /proc/pressure/cpu output, and found no hint about that CPU "full" line in psi Documentation. % cat /proc/pressure/cpu some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489 full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277 The PSI_CPU_FULL state is introduced by commit e7fcd7622823 ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level, but also counted at the system level as a side effect. Naturally, the FULL state doesn't exist for the CPU resource at the system level. These "full" numbers can come from CPU idle schedule latency. For example, t1 is the time when task wakeup on an idle CPU, t2 is the time when CPU pick and switch to it. The delta of (t2 - t1) will be in CPU_FULL state. Another case all processes can be stalled is when all cgroups have been throttled at the same time, which unlikely to happen. Anyway, CPU_FULL metric is meaningless and confusing at the system level. So this patch will report zeroes for CPU full at the system level, and update psi Documentation accordingly. Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state") Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com (cherry picked from commit 890d550d7dbac7a31ecaa78732aa22be282bb6b8) (cherry picked from commit f5187de2b75d019739caec97e8e7886a27e8554c) (cherry picked from commit 669718aaede6df28e37f6a11c0e257d18f050b1a)	2024-12-18 15:01:07 +01:00
John Stultz	91839ef7d5	ANDROID: sched: Fix off-by-one with cpupri MAX_RT_PRIO evaluation This patch addresses an issue seen where SCHED_FIFO prio 99 tasks were being woken up on a cpu where a long-running softirq was executing, and the RT task was not being migrated, causing long (10+ms wakeup latencies). Prior to upstream commit 934fc3314b39 ("sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1") the task->prio -> cpupri mapping is a little ackward. For RT tasks, its calculated as: cpupri = MAX_RT_PRIO - prio + 1; See: https://android.googlesource.com/kernel/common/+/refs/heads/android13-5.10/kernel/sched/cpupri.c#39 This is added ontop of the also ackward detail that the task->prio is inverted (RT prio99 -> 0), means the cpupri mapping for RT tasks goes from 2->101. This makes it easy to evaluate the cpupri incorrectly. Which it turns out happened In commit 3adfd8e344ac ("ANDROID: sched: avoid placing RT threads on cores handling softirqs"): `3adfd8e344`%5E%21/ With the lines: int task_pri = convert_prio(p->prio); bool drop_nopreempts = task_pri <= MAX_RT_PRIO; Where the added logic to decide to migrate a rt task off of a cpu depended on this drop_nopreempts being true. This works properly for rt tasks from prio 99 to 1, but for the case of task->prio == 0 (userland rt prio 99 tasks) it breaks, as the cpupri will be MAX_RT_PRIO - 0 + 1, which then gets checked as <= MAX_RT_PRIO. This prevents the cpu from being dropped from the scheduling set and prevents the rt user prio 99 task from migrating, which results in high priority rt tasks being left on cpus where long running softirqs are executing, causing long latencies. This patch fixes the off by one by changing the evaulation to MAX_RT_PRIO + 1, so that user-prio 99 tasks will also be migrated if appropriate. Luckilly this odd cpupri mapping has been fixed upstream, making this patch no longer necessary in 5.15 and newer kernels. Fixes: 3adfd8e344ac ("ANDROID: sched: avoid placing RT threads on cores handling softirqs") Signed-off-by: John Stultz <jstultz@google.com> Change-Id: Ia2db7cd461eb4c90f5850b791de1ae95582f7530 (cherry picked from commit bec3b2f002cd13d44e32710315de91e636ecef84) (cherry picked from commit e8e441d4a2f6b6a316d229f8ff1ebe0f40099817) (cherry picked from commit 1c1f38f1e1fa82e09e0e3f6f941ca81127682ec3)	2024-12-18 15:01:04 +01:00
Hui Su	f93cf8f6a1	sched: Use task_current() instead of 'rq->curr == p' Use the task_current() function where appropriate. No functional change. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk	2024-12-18 15:00:48 +01:00
Josh Don	bb74b7f456	sched: Allow newidle balancing to bail out of load_balance While doing newidle load balancing, it is possible for new tasks to arrive, such as with pending wakeups. newidle_balance() already accounts for this by exiting the sched_domain load_balance() iteration if it detects these cases. This is very important for minimizing wakeup latency. However, if we are already in load_balance(), we may stay there for a while before returning back to newidle_balance(). This is most exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A very straightforward workaround to this is to adjust should_we_balance() to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are detected. This was tested with the following reproduction: - two threads that take turns sleeping and waking each other up are affined to two cores - a large number of threads with 100% utilization are pinned to all other cores Without this patch, wakeup latency was ~120us for the pair of threads, almost entirely spent in load_balance(). With this patch, wakeup latency is ~6us. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com	2024-12-18 12:21:46 +01:00
Vincent Guittot	180d869ae6	sched/pelt: Relax the sync of load_sum with load_avg Similarly to util_avg and util_sum, don't sync load_sum with the low bound of load_avg but only ensure that load_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-5-vincent.guittot@linaro.org	2024-12-18 12:21:43 +01:00

1 2 3 4 5 ...

446 commits