kernel_samsung_a53x

Author	SHA1	Message	Date
Masahito S	0769637133	5.4 compile error fix: ISO C90 forbids mixed declarations and code	2024-12-18 15:09:49 +01:00
Masahito S	30aca2aa4e	5.4 compile error fix: Use SYSCTL_ZERO and SYSCTL_ONE	2024-12-18 15:09:47 +01:00
Masahito S	ff6bc15f2a	sched: linux5.4.y-bore5.1.0	2024-12-18 15:09:41 +01:00
Mubashir Adnan Qureshi	d496daa833	tcp: add sysctls for TCP PLB parameters PLB (Protective Load Balancing) is a host based mechanism for load balancing across switch links. It leverages congestion signals(e.g. ECN) from transport layer to randomly change the path of the connection experiencing congestion. PLB changes the path of the connection by changing the outgoing IPv6 flow label for IPv6 connections (implemented in Linux by calling sk_rethink_txhash()). Because of this implementation mechanism, PLB can currently only work for IPv6 traffic. For more information, see the SIGCOMM 2022 paper: https://doi.org/10.1145/3544216.3544226 This commit adds new sysctl knobs and sets their default values for TCP PLB. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:08:12 +01:00
Mubashir Adnan Qureshi	f3217ac2c6	tcp: add support for PLB in DCTCP PLB support is added to TCP DCTCP code. As DCTCP uses ECN as the congestion signal, PLB also uses ECN to make decisions whether to change the path or not upon sustained congestion. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:57 +01:00
Mubashir Adnan Qureshi	a00aad3a04	tcp: add rcv_wnd and plb_rehash to TCP_INFO rcv_wnd can be useful to diagnose TCP performance where receiver window becomes the bottleneck. rehash reports the PLB and timeout triggered rehash attempts by the TCP connection. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:43 +01:00
Eric Dumazet	c8a587ff65	tcp: add accessors to read/set tp->snd_cwnd We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero. Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug. Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:40 +01:00
Yuchung Cheng	33eb09a194	tcp: tracking packets with CE marks in BW rate sample In order to track CE marks per rate sample (one round trip), TCP needs a per-skb header field to record the tp->delivered_ce count when the skb was sent. To make space, we replace the "last_in_flight" field which is used exclusively for NV congestion control. The stat needed by NV can be alternatively approximated by existing stats tcp_sock delivered and mss_cache. This patch counts the number of packets delivered which have CE marks in the rate sample, using similar approach of delivery accounting. Cc: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Luke Hsiao <lukehsiao@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:35 +01:00
Neal Cardwell	0f3a19122b	net-tcp_bbr: broaden app-limited rate sample detection This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:30 +01:00
Neal Cardwell	0aafceb293	net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:14 +01:00
Neal Cardwell	1970659e8a	net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:09 +01:00
Neal Cardwell	20cf76b03c	net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb() When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:04 +01:00
Neal Cardwell	eddb362fbb	net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment() When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Origin-9xx-SHA1: a0eb099690af net-tcp_bbr: v2: fix tcp_fragment() tx.in_flight recomputation [prod feb 8 2021; use as a fixup] Origin-9xx-SHA1: 885503228153ff0c9114e net-tcp_bbr: v2: introduce tcp_skb_tx_in_flight_is_suspicious() helper for warnings Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:07:00 +01:00
Yousuk Seung	f1bd4e759f	net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:57 +01:00
Neal Cardwell	54113b9ad5	net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:47 +01:00
Jianfeng Wang	96c42cd35d	net-tcp_bbr: v2: inform CC module of losses repaired by TLP probe Before this commit, when there is a packet loss that creates a sequence hole that is filled by a TLP loss probe, then tcp_process_tlp_ack() only informs the congestion control (CC) module via a back-to-back entry and exit of CWR. But some congestion control modules (e.g. BBR) do not respond to CWR events. This commit adds a new CA event with which the core TCP stack notifies the CC module when a loss is repaired by a TLP. This will allow CC modules that do not use the CWR mechanism to have a custom handler for such TLP recoveries. Effort: net-tcp_bbr Change-Id: Ieba72332b401b329bff5a641d2b2043a3fb8f632 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:40 +01:00
Neal Cardwell	4a3e7b65cd	net-tcp_bbr: v2: introduce is_acking_tlp_retrans_seq into rate_sample Introduce is_acking_tlp_retrans_seq into rate_sample. This bool will export to the CC module the knowledge of whether the current ACK matched a TLP retransmit. Note that when this bool is true, we cannot yet tell (in general) whether this ACK is for the original or the TLP retransmit. Effort: net-tcp_bbr Change-Id: I2e6494332167e75efcbdc99bd5c119034e9c39b4 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:33 +01:00
David Morley	c57053bdc6	tcp: introduce per-route feature RTAX_FEATURE_ECN_LOW Define and implement a new per-route feature, RTAX_FEATURE_ECN_LOW. This feature indicates that the given destination network is a low-latency ECN environment, meaning both that ECN CE marks are applied by the network using a low-latency marking threshold and also that TCP endpoints provide precise per-data-segment ECN feedback in ACKs (where the ACK ECE flag echoes the received CE status of all newly-acknowledged data segments). This feature indication can be used by congestion control algorithms to decide how to interpret ECN signals over the given destination network. This feature is appropriate for datacenter-style ECN marking, such as the ECN marking approach expected by DCTCP or BBR congestion control modules. Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Tested-by: David Morley <morleyd@google.com> Change-Id: I6bc06e9c6cb426fbae7243fc71c9a8c18175f5d3 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:30 +01:00
Neal Cardwell	62a04f2316	tcp: export TCPI_OPT_ECN_LOW in tcp_info tcpi_options field Analogous to other important ECN information, export TCPI_OPT_ECN_LOW in tcp_info tcpi_options field. Signed-off-by: Neal Cardwell <ncardwell@google.com> Change-Id: I08d8d8c7e8780e6e37df54038ee50301ac5a0320 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>	2024-12-18 15:06:07 +01:00
Sultan Alsawaf	3764b4f297	mm: Omit RCU read lock in list_lru_count_one() when RCU isn't needed The RCU read lock isn't necessary in list_lru_count_one() when the condition that requires RCU (CONFIG_MEMCG && !CONFIG_SLOB) isn't met. The highly-frequent RCU lock and unlock adds measurable overhead to the shrink_slab() path when it isn't needed. As such, we can simply omit the RCU read lock in this case to improve performance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: priiii1808 <priyanshusinghal0818@gmail.com>	2024-12-18 15:05:51 +01:00
Al Viro	436eb37b50	d_add_ci(): make sure we don't miss d_lookup_done() All callers of d_alloc_parallel() must make sure that resulting in-lookup dentry (if any) will encounter __d_lookup_done() before the final dput(). d_add_ci() might end up creating in-lookup dentries; they are fed to d_splice_alias(), which will normally make sure they meet __d_lookup_done(). However, it is possible to end up with d_splice_alias() failing with ERR_PTR(-ELOOP) without having done so. It takes a corrupted ntfs or case-insensitive xfs image, but neither should end up with memory corruption... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-12-18 15:05:46 +01:00
Sebastian Andrzej Siewior	16ecd487af	sched/rt: Don't try push tasks if there are none. I have a RT task X at a high priority and cyclictest on each CPU with lower priority than X's. If X is active and each CPU wakes their own cylictest thread then it ends in a longer rto_push storm. A random CPU determines via balance_rt() that the CPU on which X is running needs to push tasks. X has the highest priority, cyclictest is next in line so there is nothing that can be done since the task with the higher priority is not touched. tell_cpu_to_push() increments rto_loop_next and schedules rto_push_irq_work_func() on X's CPU. The other CPUs also increment the loop counter and do the same. Once rto_push_irq_work_func() is active it does nothing because it has _no_ pushable tasks on its runqueue. Then checks rto_next_cpu() and decides to queue irq_work on the local CPU because another CPU requested a push by incrementing the counter. I have traces where ~30 CPUs request this ~3 times each before it finally ends. This greatly increases X's runtime while X isn't making much progress. Teach rto_next_cpu() to only return CPUs which also have tasks on their runqueue which can be pushed away. This does not reduce the tell_cpu_to_push() invocations (rto_loop_next counter increments) but reduces the amount of issued rto_push_irq_work_func() if nothing can be done. As the result the overloaded CPU is blocked less often. There are still cases where the "same job" is repeated several times (for instance the current CPU needs to resched but didn't yet because the irq-work is repeated a few times and so the old task remains on the CPU) but the majority of request end in tell_cpu_to_push() before an IPI is issued. Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>	2024-12-18 15:05:40 +01:00
Yaroslav Furman	9c41dd9383	rcu: Fix a performance regression. Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU to using local workqueses and removed power efficiency flag from them. This caused a performance regression that can be observed in Geekbench 5 after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from 760/2500 to 620/2300 (single/multi core respectively). Add WQ_POWER_EFFICIENT flag to avoid this regression. Change-Id: I2c4f41faa55548f9e81a1c0cbe10703948062d89	2024-12-18 15:05:33 +01:00
Carlos Llamas	5380adeb80	BACKPORT: FROMLIST: binder: fix freeze UAF in binder_release_work() When a binder reference is cleaned up, any freeze work queued in the associated process should also be removed. Otherwise, the reference is freed while its ref->freeze.work is still queued in proc->work leading to a use-after-free issue as shown by the following KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in binder_release_work+0x398/0x3d0 Read of size 8 at addr ffff31600ee91488 by task kworker/5:1/211 CPU: 5 UID: 0 PID: 211 Comm: kworker/5:1 Not tainted 6.11.0-rc7-00382-gfc6c92196396 #22 Hardware name: linux,dummy-virt (DT) Workqueue: events binder_deferred_func Call trace: binder_release_work+0x398/0x3d0 binder_deferred_func+0xb60/0x109c process_one_work+0x51c/0xbd4 worker_thread+0x608/0xee8 Allocated by task 703: __kmalloc_cache_noprof+0x130/0x280 binder_thread_write+0xdb4/0x42a0 binder_ioctl+0x18f0/0x25ac __arm64_sys_ioctl+0x124/0x190 invoke_syscall+0x6c/0x254 Freed by task 211: kfree+0xc4/0x230 binder_deferred_func+0xae8/0x109c process_one_work+0x51c/0xbd4 worker_thread+0x608/0xee8 ================================================================== This commit fixes the issue by ensuring any queued freeze work is removed when cleaning up a binder reference. Fixes: d579b04a52a1 ("binder: frozen notification") Cc: stable@vger.kernel.org Signed-off-by: Carlos Llamas <cmllamas@google.com> Bug: 366003708 Link: https://lore.kernel.org/all/20240924184401.76043-4-cmllamas@google.com/ Change-Id: Icc40e7dd6157981f4adbea7243e55be118552321 [cmllamas: drop BINDER_STAT_FREEZE as it's not supported here] Signed-off-by: Carlos Llamas <cmllamas@google.com>	2024-12-18 15:03:54 +01:00
Carlos Llamas	1cf14664be	BACKPORT: FROMGIT: binder: fix memleak of proc->delivered_freeze If a freeze notification is cleared with BC_CLEAR_FREEZE_NOTIFICATION before calling binder_freeze_notification_done(), then it is detached from its reference (e.g. ref->freeze) but the work remains queued in proc->delivered_freeze. This leads to a memory leak when the process exits as any pending entries in proc->delivered_freeze are not freed: unreferenced object 0xffff38e8cfa36180 (size 64): comm "binder-util", pid 655, jiffies 4294936641 hex dump (first 32 bytes): b8 e9 9e c8 e8 38 ff ff b8 e9 9e c8 e8 38 ff ff .....8.......8.. 0b 00 00 00 00 00 00 00 3c 1f 4b 00 00 00 00 00 ........<.K..... backtrace (crc 95983b32): [<000000000d0582cf>] kmemleak_alloc+0x34/0x40 [<000000009c99a513>] __kmalloc_cache_noprof+0x208/0x280 [<00000000313b1704>] binder_thread_write+0xdec/0x439c [<000000000cbd33bb>] binder_ioctl+0x1b68/0x22cc [<000000002bbedeeb>] __arm64_sys_ioctl+0x124/0x190 [<00000000b439adee>] invoke_syscall+0x6c/0x254 [<00000000173558fc>] el0_svc_common.constprop.0+0xac/0x230 [<0000000084f72311>] do_el0_svc+0x40/0x58 [<000000008b872457>] el0_svc+0x38/0x78 [<00000000ee778653>] el0t_64_sync_handler+0x120/0x12c [<00000000a8ec61bf>] el0t_64_sync+0x190/0x194 This patch fixes the leak by ensuring that any pending entries in proc->delivered_freeze are freed during binder_deferred_release(). Fixes: d579b04a52a1 ("binder: frozen notification") Cc: stable@vger.kernel.org Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Todd Kjos <tkjos@google.com> Link: https://lore.kernel.org/r/20240926233632.821189-8-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Bug: 366003708 (cherry picked from commit 1db76ec2b4b206ff943e292a0b55e68ff3443598 git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-next) Change-Id: Iafdec3421c521b4b591b94455deba7ee5102c8ca [cmllamas: drop BINDER_STAT_FREEZE and use binder_proc_ext_entry()] Signed-off-by: Carlos Llamas <cmllamas@google.com>	2024-12-18 15:03:35 +01:00
Carlos Llamas	4554485eb2	BACKPORT: FROMGIT: binder: add delivered_freeze to debugfs output Add the pending proc->delivered_freeze work to the debugfs output. This information was omitted in the original implementation of the freeze notification and can be valuable for debugging issues. Fixes: d579b04a52a1 ("binder: frozen notification") Cc: stable@vger.kernel.org Signed-off-by: Carlos Llamas <cmllamas@google.com> Acked-by: Todd Kjos <tkjos@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20240926233632.821189-9-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Bug: 366003708 (cherry picked from commit cb2aeb2ec25884133110ffe5a67ff3cf7dee5ceb git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-next) Change-Id: Ifc9a22b52e38c35af661732486fa1f154adb34de [cmllamas: fix KMI break with binder_proc_ext_entry()] Signed-off-by: Carlos Llamas <cmllamas@google.com>	2024-12-18 15:03:31 +01:00
Sultan Alsawaf	73ad55b1e6	sched/rt: Change default SCHED_RR timeslice from 100 ms to 1 jiffy For us, it's most helpful to have the round-robin timeslice as low as is allowed by the scheduler to reduce latency. Since it's limited by the scheduler tick rate, just set the default to 1 jiffy, which is the lowest possible value. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Change-Id: I6c9f6bb5bbadf363efb719d3e30b0b073654d688	2024-12-18 15:03:27 +01:00
Vincent Guittot	0f92567070	sched/fair: Make sure to try to detach at least one movable task During load balance, we try at most env->loop_max time to move a task. But it can happen that the loop_max LRU tasks (ie tail of the cfs_tasks list) can't be moved to dst_cpu because of affinity. In this case, loop in the list until we found at least one. The maximum of detached tasks remained the same as before. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org	2024-12-18 15:03:21 +01:00
Miguel de Dios	8b98e08a04	sched: reduce softirq conflicts with RT This is a forward port of pa/890483 with modifications from the original patch due to changes in sched/softirq.c which applies the same logic. We're finding audio glitches caused by audio-producing RT tasks that are either interrupted to handle softirq's or that are scheduled onto cpu's that are handling softirq's. In a previous patch, we attempted to catch many cases of the latter problem, but it's clear that we are still losing significant numbers of races in some apps. This patch attempts to address the following problem:: It attempts to reduce the most common windows in which we lose the race between scheduling an RT task on a remote core and starting to handle softirq's on that core. We still lose some races, but we lose significantly fewer. (And we don't want to introduce any heavyweight forms of synchronization on these paths.) Bug: 64912585 Bug: 136771796 Bug: 144961676 Change-Id: Ida89a903be0f1965552dd0e84e67ef1d3158c7d8 Signed-off-by: Miguel de Dios <migueldedios@google.com> Signed-off-by: Jimmy Shiu <jimmyshiu@google.com> (cherry picked from commit 3c2569f21be6b7c457b389821b098d30bd03b84c) (cherry picked from commit f9cf84966a315169f3f505993b9e15b6c53f2bbb) (cherry picked from commit e6edaa29efee9e2f2451ab50030d8cde19ab227b) (cherry picked from commit db52e23658eaf304ba2333f04d910313c0c21e05) (cherry picked from commit aa2d4d07929d1c06c6b99ed8718394cf8e471d21) (cherry picked from commit cefdb96edccdc3f788275d1bcef89c8dfeeb8b11) (cherry picked from commit 5a45a0208069ac5704713b35a0858fcd5997f97a) (cherry picked from commit 86ac137eb7399547c6cd20d60063d244091e920f)	2024-12-18 15:03:18 +01:00
Samuel Pascua	04ff6cbc91	drivers: soc: acpm: Prevent optimization of 'acpm_initdata' More optimization issues when compiling with Clang. Panics happen when the device goes into standby with the following report. <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM:: MIF down. cur_count: 5, acc_count: 5 <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM:: MIF_UP history: <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM: mifuser: 0x540000, time: 5:35:40, latency: 1955[usec] <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM: mifuser: 0x400000, time: 5:35:40, latency: 1956[usec] <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM: mifuser: 0x100000, time: 5:35:41, latency: 1954[usec] <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM: mifuser: 0x400000, time: 5:35:41, latency: 1955[usec] <6>[ 1470.900859] [0: Binder:4157_2: 8735] EXYNOS-PM: mifuser: 0x100000, time: 5:35:41, latency: 1955[usec] <0>[ 1470.900859] [0: Binder:4157_2: 8735] Unable to handle kernel paging request at virtual address ffffff800b346f9c <2>[ 1470.900859] [0: Binder:4157_2: 8735] sec_debug_set_extra_info_fault = KERN / 0xffffff800b346f9c <1>[ 1470.900859] [0: Binder:4157_2: 8735] Mem abort info: <1>[ 1470.900859] [0: Binder:4157_2: 8735] Exception class = DABT (current EL), IL = 32 bits <1>[ 1470.900859] [0: Binder:4157_2: 8735] SET = 0, FnV = 0 <1>[ 1470.900859] [0: Binder:4157_2: 8735] EA = 0, S1PTW = 0 <1>[ 1470.900859] [0: Binder:4157_2: 8735] Data abort info: <1>[ 1470.900859] [0: Binder:4157_2: 8735] ISV = 0, ISS = 0x00000061 <1>[ 1470.900859] [0: Binder:4157_2: 8735] CM = 0, WnR = 1 <1>[ 1470.900859] [0: Binder:4157_2: 8735] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800a66a000 <1>[ 1470.900859] [0: Binder:4157_2: 8735] [ffffff800b346f9c] pgd=000000097cdfe003, pud=000000097cdfe003, pmd=00000009740b7003, pte=00e800000203f707 <0>[ 1470.900859] [0: Binder:4157_2: 8735] Internal error: Oops: 96000061 [#1] PREEMPT SMP <4>[ 1470.900859] [0: Binder:4157_2: 8735] Modules linked in: <0>[ 1470.900859] [0: Binder:4157_2: 8735] Process Binder:4157_2 (pid: 8735, stack limit = 0xffffff8039708000) <0>[ 1470.900859] [0: Binder:4157_2: 8735] debug-snapshot: core register saved(CPU:0) <0>[ 1470.900859] [0: Binder:4157_2: 8735] L2ECTLR_EL1: 0000000000000007 <0>[ 1470.900859] [0: Binder:4157_2: 8735] L2ECTLR_EL1 valid_bit(30) is NOT set (0x0) <0>[ 1470.900859] [0: Binder:4157_2: 8735] CPUMERRSR: 0000000008000001, L2MERRSR: 0000000010200c00 <0>[ 1470.900859] [0: Binder:4157_2: 8735] CPUMERRSR valid_bit(31) is NOT set (0x0) <0>[ 1470.900859] [0: Binder:4157_2: 8735] L2MERRSR valid_bit(31) is NOT set (0x0) <0>[ 1470.900859] [0: Binder:4157_2: 8735] debug-snapshot: context saved(CPU:0) <6>[ 1470.900859] [0: Binder:4157_2: 8735] debug-snapshot: item - log_kevents is disabled <6>[ 1470.900859] [0: Binder:4157_2: 8735] TIF_FOREIGN_FPSTATE: 1, FP/SIMD depth 0, cpu: 0 <4>[ 1470.900859] [0: Binder:4157_2: 8735] CPU: 0 PID: 8735 Comm: Binder:4157_2 Not tainted 4.14.113 - Fresh Core-user #1 <4>[ 1470.900859] [0: Binder:4157_2: 8735] Hardware name: Samsung A50 LTN OPEN rev04 board based on Exynos9610 (DT) <4>[ 1470.900859] [0: Binder:4157_2: 8735] task: ffffffc0466d6000 task.stack: ffffff8039708000 <4>[ 1470.900859] [0: Binder:4157_2: 8735] PC is at acpm_get_inform+0x90/0x100 <4>[ 1470.900859] [0: Binder:4157_2: 8735] LR is at acpm_get_inform+0x7c/0x100 <4>[ 1470.900859] [0: Binder:4157_2: 8735] pc : [<ffffff8008505cd4>] lr : [<ffffff8008505cc0>] pstate: 604001c5 <4>[ 1470.900859] [0: Binder:4157_2: 8735] sp : ffffff803970bac0 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x29: ffffff803970bac0 x28: ffffffc0466d6000 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x27: ffffff8008e44b64 x26: ffffff8008e44b3e <4>[ 1470.900859] [0: Binder:4157_2: 8735] x25: ffffff8009e5f210 x24: 0000000010624dd3 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x23: 0000000000000029 x22: 0000000000000018 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x21: ffffff8009e2c000 x20: ffffff8008ef6785 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x19: ffffff8008ef674c x18: 00000000000000a0 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x17: ffffff8009b3023c x16: 0000000000000001 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x15: ffffff8008c8a964 x14: 202c303030303031 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x13: 7830203a72657375 x12: 0000000000000000 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x11: 0000000000000000 x10: ffffffffffffffff <4>[ 1470.900859] [0: Binder:4157_2: 8735] x9 : ffffff800b346f00 x8 : ffffff800b346f00 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x7 : 203a79636e657461 x6 : ffffff80f615273c <4>[ 1470.900859] [0: Binder:4157_2: 8735] x5 : 000000000000221f x4 : 000000000000000c <4>[ 1470.900859] [0: Binder:4157_2: 8735] x3 : 000000000000000a x2 : 0000000000000000 <4>[ 1470.900859] [0: Binder:4157_2: 8735] x1 : 00000000000001c0 x0 : 0000000000000041 Similar solution as d855e6f. Make the structs volatile to prevent optimization. Signed-off-by: John Vincent <git@tensevntysevn.cf> Signed-off-by: Samuel Pascua <sgpascua@ngcp.ph>	2024-12-18 15:02:50 +01:00
xxmustafacooTR	5843cd8543	drivers: soc: samsung: acpm: disable lto	2024-12-18 15:02:45 +01:00
xxmustafacooTR	9ca91577b1	fvmap: optimize voltages	2024-12-18 15:02:40 +01:00
Redick Lin	70c79241d2	soc: samsung: acpm: extend the timeout for acpm ipc retry Extend it from 15ms to 200ms Bug: 172883429 Change-Id: I39e8e860dfeaa4d1d3b702f06dca51dd01bc8367 Signed-off-by: Redick Lin <redicklin@google.com>	2024-12-18 15:02:36 +01:00
Nahuel Gómez	a13ddad5af	vboot_dlkm: drop duplicate sm5451 module Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-18 15:02:19 +01:00
Nahuel Gómez	375f7e1318	ARM64: dts/s5e8825: GPU undervolt to 790mV This is a base value only though, I'm not sure if it actually works. Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-12-18 15:01:20 +01:00
Zhaoyang Huang	f11d790f3a	psi: fix possible trigger missing in the window When a new threshold breaching stall happens after a psi event was generated and within the window duration, the new event is not generated because the events are rate-limited to one per window. If after that no new stall is recorded then the event will not be generated even after rate-limiting duration has passed. This is happening because with no new stall, window_update will not be called even though threshold was previously breached. To fix this, record threshold breaching occurrence and generate the event once window duration is passed. Suggested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com (cherry picked from commit e6df4ead85d9da1b07dd40bd4c6d2182f3e210c4) (cherry picked from commit 007db79713251fd67cc5ce7142942772f6074aca) (cherry picked from commit 129c4324ce9ea8be9c528629bca72e1e1f80d3fb)	2024-12-18 15:01:14 +01:00
Hailong Liu	e48c85b067	psi: Fix trigger being fired unexpectedly at initial When a trigger being created, its win.start_value and win.start_time are reset to zero. If group->total[PSI_POLL][t->state] has accumulated before, this trigger will be fired unexpectedly in the next period, even if its growth time does not reach its threshold. So set the window of the new trigger to the current state value. Signed-off-by: Hailong Liu <liuhailong@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/1648789811-3788971-1-git-send-email-liuhailong@linux.alibaba.com (cherry picked from commit 915a087e4c47334a2f7ba2a4092c4bade0873769) (cherry picked from commit 168af79cd0d6750ddf6c775c79833059da606c34) (cherry picked from commit 9b3b0d7ba99e41a1159a0b9c58ff138bb51d52a1)	2024-12-18 15:01:10 +01:00
Chengming Zhou	09558667e8	sched/psi: report zeroes for CPU full at the system level Martin find it confusing when look at the /proc/pressure/cpu output, and found no hint about that CPU "full" line in psi Documentation. % cat /proc/pressure/cpu some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489 full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277 The PSI_CPU_FULL state is introduced by commit e7fcd7622823 ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level, but also counted at the system level as a side effect. Naturally, the FULL state doesn't exist for the CPU resource at the system level. These "full" numbers can come from CPU idle schedule latency. For example, t1 is the time when task wakeup on an idle CPU, t2 is the time when CPU pick and switch to it. The delta of (t2 - t1) will be in CPU_FULL state. Another case all processes can be stalled is when all cgroups have been throttled at the same time, which unlikely to happen. Anyway, CPU_FULL metric is meaningless and confusing at the system level. So this patch will report zeroes for CPU full at the system level, and update psi Documentation accordingly. Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state") Reported-by: Martin Steigerwald <Martin.Steigerwald@proact.de> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com (cherry picked from commit 890d550d7dbac7a31ecaa78732aa22be282bb6b8) (cherry picked from commit f5187de2b75d019739caec97e8e7886a27e8554c) (cherry picked from commit 669718aaede6df28e37f6a11c0e257d18f050b1a)	2024-12-18 15:01:07 +01:00
John Stultz	91839ef7d5	ANDROID: sched: Fix off-by-one with cpupri MAX_RT_PRIO evaluation This patch addresses an issue seen where SCHED_FIFO prio 99 tasks were being woken up on a cpu where a long-running softirq was executing, and the RT task was not being migrated, causing long (10+ms wakeup latencies). Prior to upstream commit 934fc3314b39 ("sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1") the task->prio -> cpupri mapping is a little ackward. For RT tasks, its calculated as: cpupri = MAX_RT_PRIO - prio + 1; See: https://android.googlesource.com/kernel/common/+/refs/heads/android13-5.10/kernel/sched/cpupri.c#39 This is added ontop of the also ackward detail that the task->prio is inverted (RT prio99 -> 0), means the cpupri mapping for RT tasks goes from 2->101. This makes it easy to evaluate the cpupri incorrectly. Which it turns out happened In commit 3adfd8e344ac ("ANDROID: sched: avoid placing RT threads on cores handling softirqs"): `3adfd8e344`%5E%21/ With the lines: int task_pri = convert_prio(p->prio); bool drop_nopreempts = task_pri <= MAX_RT_PRIO; Where the added logic to decide to migrate a rt task off of a cpu depended on this drop_nopreempts being true. This works properly for rt tasks from prio 99 to 1, but for the case of task->prio == 0 (userland rt prio 99 tasks) it breaks, as the cpupri will be MAX_RT_PRIO - 0 + 1, which then gets checked as <= MAX_RT_PRIO. This prevents the cpu from being dropped from the scheduling set and prevents the rt user prio 99 task from migrating, which results in high priority rt tasks being left on cpus where long running softirqs are executing, causing long latencies. This patch fixes the off by one by changing the evaulation to MAX_RT_PRIO + 1, so that user-prio 99 tasks will also be migrated if appropriate. Luckilly this odd cpupri mapping has been fixed upstream, making this patch no longer necessary in 5.15 and newer kernels. Fixes: 3adfd8e344ac ("ANDROID: sched: avoid placing RT threads on cores handling softirqs") Signed-off-by: John Stultz <jstultz@google.com> Change-Id: Ia2db7cd461eb4c90f5850b791de1ae95582f7530 (cherry picked from commit bec3b2f002cd13d44e32710315de91e636ecef84) (cherry picked from commit e8e441d4a2f6b6a316d229f8ff1ebe0f40099817) (cherry picked from commit 1c1f38f1e1fa82e09e0e3f6f941ca81127682ec3)	2024-12-18 15:01:04 +01:00
Hui Su	f93cf8f6a1	sched: Use task_current() instead of 'rq->curr == p' Use the task_current() function where appropriate. No functional change. Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk	2024-12-18 15:00:48 +01:00
Ksawlii	b6b17fef92	Revert "defconfig: a53x_defconfig: Nuke KPROBE" This reverts commit `d30b19e276`.	2024-12-18 14:59:29 +01:00
Ksawlii	d30b19e276	defconfig: a53x_defconfig: Nuke KPROBE	2024-12-18 14:14:56 +01:00
Ksawlii	d1a6ca7818	defconfigs: a53x*: Regenerated with newer clang (19) and Linux 5.10.231	2024-12-18 12:27:51 +01:00
Josh Don	bb74b7f456	sched: Allow newidle balancing to bail out of load_balance While doing newidle load balancing, it is possible for new tasks to arrive, such as with pending wakeups. newidle_balance() already accounts for this by exiting the sched_domain load_balance() iteration if it detects these cases. This is very important for minimizing wakeup latency. However, if we are already in load_balance(), we may stay there for a while before returning back to newidle_balance(). This is most exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A very straightforward workaround to this is to adjust should_we_balance() to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are detected. This was tested with the following reproduction: - two threads that take turns sleeping and waking each other up are affined to two cores - a large number of threads with 100% utilization are pinned to all other cores Without this patch, wakeup latency was ~120us for the pair of threads, almost entirely spent in load_balance(). With this patch, wakeup latency is ~6us. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com	2024-12-18 12:21:46 +01:00
Vincent Guittot	180d869ae6	sched/pelt: Relax the sync of load_sum with load_avg Similarly to util_avg and util_sum, don't sync load_sum with the low bound of load_avg but only ensure that load_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-5-vincent.guittot@linaro.org	2024-12-18 12:21:43 +01:00
Vincent Guittot	9340880407	sched/pelt: Relax the sync of runnable_sum with runnable_avg Similarly to util_avg and util_sum, don't sync runnable_sum with the low bound of runnable_avg but only ensure that runnable_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-4-vincent.guittot@linaro.org	2024-12-18 12:21:39 +01:00
Vincent Guittot	90b4b30575	sched/pelt: Continue to relax the sync of util_sum with util_avg Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") This commit forces util_sum to be synced with the new util_avg after removing the contribution of a task and before the next periodic sync. By doing so util_sum is rounded to its lower bound and might lost up to LOAD_AVG_MAX-1 of accumulated contribution which has not yet been reflected in util_avg. update_tg_cfs_util() is not the only place where we round util_sum and lost some accumulated contributions that are not already reflected in util_avg. Modify update_tg_cfs_util() and detach_entity_load_avg() to not sync util_sum with the new util_avg. Instead of always setting util_sum to the low bound of util_avg, which can significantly lower the utilization, we propagate the difference. In addition, we also check that cfs's util_sum always stays above the lower bound for a given util_avg as it has been observed that sched_entity's util_sum is sometimes above cfs one. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-3-vincent.guittot@linaro.org	2024-12-18 12:21:35 +01:00
Mathias Krause	b59eaa682e	sched/fair: Prevent dead task groups from regaining cfs_rq's Kevin is reporting crashes which point to a use-after-free of a cfs_rq in update_blocked_averages(). Initial debugging revealed that we've live cfs_rq's (on_list=1) in an about to be kfree()'d task group in free_fair_sched_group(). However, it was unclear how that can happen. His kernel config happened to lead to a layout of struct sched_entity that put the 'my_q' member directly into the middle of the object which makes it incidentally overlap with SLUB's freelist pointer. That, in combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, leads to a reliable access violation in form of a #GP which made the UAF fail fast. Michal seems to have run into the same issue[1]. He already correctly diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") is causing the preconditions for the UAF to happen by re-adding cfs_rq's also to task groups that have no more running tasks, i.e. also to dead ones. His analysis, however, misses the real root cause and it cannot be seen from the crash backtrace only, as the real offender is tg_unthrottle_up() getting called via sched_cfs_period_timer() via the timer interrupt at an inconvenient time. When unregister_fair_sched_group() unlinks all cfs_rq's from the dying task group, it doesn't protect itself from getting interrupted. If the timer interrupt triggers while we iterate over all CPUs or after unregister_fair_sched_group() has finished but prior to unlinking the task group, sched_cfs_period_timer() will execute and walk the list of task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the dying task group. These will later -- in free_fair_sched_group() -- be kfree()'ed while still being linked, leading to the fireworks Kevin and Michal are seeing. To fix this race, ensure the dying task group gets unlinked first. However, simply switching the order of unregistering and unlinking the task group isn't sufficient, as concurrent RCU walkers might still see it, as can be seen below: CPU1: CPU2: : timer IRQ: : do_sched_cfs_period_timer(): : : : distribute_cfs_runtime(): : rcu_read_lock(); : : : unthrottle_cfs_rq(): sched_offline_group(): : : walk_tg_tree_from(…,tg_unthrottle_up,…): list_del_rcu(&tg->list); : (1) : list_for_each_entry_rcu(child, &parent->children, siblings) : : (2) list_del_rcu(&tg->siblings); : : tg_unthrottle_up(): unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; : : list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : : : : if (!cfs_rq_is_decayed(cfs_rq) \|\| cfs_rq->nr_running) (3) : list_add_leaf_cfs_rq(cfs_rq); : : : : : : : : : : (4) : rcu_read_unlock(); CPU 2 walks the task group list in parallel to sched_offline_group(), specifically, it'll read the soon to be unlinked task group entry at (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of these at (3), which is the cause of the UAF later on. To prevent this additional race from happening, we need to wait until walk_tg_tree_from() has finished traversing the task groups, i.e. after the RCU read critical section ends in (4). Afterwards we're safe to call unregister_fair_sched_group(), as each new walk won't see the dying task group any more. On top of that, we need to wait yet another RCU grace period after unregister_fair_sched_group() to ensure print_cfs_stats(), which might run concurrently, always sees valid objects, i.e. not already free'd ones. This patch survives Michal's reproducer[2] for 8h+ now, which used to trigger within minutes before. [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") [peterz: shuffle code around a bit] Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2024-12-18 12:21:30 +01:00
Vincent Guittot	cd10658059	sched/fair: Remove sysctl_sched_migration_cost condition With a default value of 500us, sysctl_sched_migration_cost is significanlty higher than the cost of load_balance. Remove the condition and rely on the sd->max_newidle_lb_cost to abort newidle_balance. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org	2024-12-18 12:21:25 +01:00
Vincent Guittot	3dc3531c78	sched/fair: Wait before decaying max_newidle_lb_cost Decay max_newidle_lb_cost only when it has not been updated for a while and ensure to not decay a recently changed value. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org	2024-12-18 12:21:16 +01:00

... 3 4 5 6 7 ...

7437 commits