Commit graph

4502 commits

Author SHA1 Message Date
ztc1997
136bbfd757 block: Add default I/O scheduler option 2024-11-19 17:43:55 +01:00
Paolo Valente
fbbabdb3bc block, bfq: use half slice_idle as a threshold to check short ttime
The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time.  In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit b5f74ecacc3139ef873e69acc3aba28083ecc416)
(cherry picked from commit b1511c438e8a5668e6be04ad9107d6695332756c)
(cherry picked from commit 389992d9dc78340676248d0f01c7569b3db950ed)
(cherry picked from commit 49919eface6f4391cda0e77bcaad3e2786cbbab3)
(cherry picked from commit 87b015de51122ea9b5d9e56b846ae945db8444f0)
(cherry picked from commit 6ada34cdc94c89e97926a2d001412ecc027e1392)
(cherry picked from commit 2782bcc2919dd2a0a1d461d36c22338e67bc6327)
2024-11-19 17:43:46 +01:00
Paolo Valente
fe945719eb block, bfq: increase time window for waker detection
Tests on slower machines showed current window to be way too
small. This commit increases it.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ab1fb47e33dc7754a7593181ffe0742c7105ea9a)
(cherry picked from commit 0d1663f1922c5f6fb3a4b3cc5a3a861c765a3704)
(cherry picked from commit 85d9e1637a38d0cfdeba4e3847f1797dcd18da5d)
(cherry picked from commit 6bd707bb9a60e2bf0e680a271208f6c82a331571)
(cherry picked from commit 43755e08d048ccd6f3b2a3bbd34bea4a71c5bc12)
(cherry picked from commit b1a8cce9e99277ce53da20ab603473ad6c3e95d1)
(cherry picked from commit 74d27133a3261a296ddd98e9ff09d89bfab797bb)
2024-11-19 17:43:43 +01:00
Paolo Valente
7034a03ec0 block, bfq: do not raise non-default weights
BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 91b896f65d32610d6d58af02170b15f8d37a7702)
(cherry picked from commit cbbd2f045e60073978fe1b721c0953cd8762ecbb)
(cherry picked from commit 88b650c71f7d0d30ac2fa215a139d7a48d069cd9)
(cherry picked from commit 9a4725f0341c71a9b4f50f2d203f9740029e42e5)
(cherry picked from commit a2c57345ffa5404cefd3d43e2fd4e4492ac7c6e0)
(cherry picked from commit df56458ca85c681d163d879b832f868ed5044c8e)
(cherry picked from commit dfc085aad98db2bcabd2c438fcd722a90303e6cb)
2024-11-19 17:43:40 +01:00
Paolo Valente
4b23f1e69b block, bfq: do not expire a queue when it is the only busy one
This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 2391d13ed484df1515f0025458e1f82317823fab)
(cherry picked from commit 79827eb41d8fb0f838a2c592775a8e63caeb7c57)
(cherry picked from commit 41720669259995fb7f064fc0f988c9d228750b37)
(cherry picked from commit 07d273c955ea2c34a42f6de0f1e3f1bfb00c6ce1)
(cherry picked from commit 8034c856b8fcafbef405eedddc12bb0625e52a42)
(cherry picked from commit f49083d304bda30647196b550a109f528c8266dc)
(cherry picked from commit 8a597f0ab5e7e83bfa426d071185c3d3ce5fa535)
2024-11-19 17:43:34 +01:00
Paolo Valente
5238084cd8 block, bfq: save also injection state on queue merging
To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 5a5436b98d5cd2714feaaa579cec49dd7f7057bb)
(cherry picked from commit 9372e98dc77c7f2ebbb808a60abb01f30d70d0bc)
(cherry picked from commit e6a5b66cfe56495f26182cfd2340e3336bb4b2b4)
(cherry picked from commit c579a3634d163ed05cc4ac258411f03db969926e)
(cherry picked from commit 359f87d07390f687634185b0dd9d6f106fb5afdd)
(cherry picked from commit d1d1f1336ed77b83e98d26175e196b45a28958f4)
(cherry picked from commit 0ff8068594640924e0cffe27d8b0273bb80d74ca)
2024-11-19 17:43:15 +01:00
Paolo Valente
0769622634 block, bfq: save also weight-raised service on queue merging
To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit e673914d52f913584cc4c454dfcff2e8eb04533f)
(cherry picked from commit 48f3cf9bb6ae73de3e8e6cad2e50c6e70a6cd33f)
(cherry picked from commit d947cf3f8bcbcbe2dd8f5eec82e83a35198f874b)
(cherry picked from commit 39b91f1f22265c70cdc48916ac694dad6c21c191)
(cherry picked from commit 421c82648e46467d29dc0b5cd5522f00a026083d)
(cherry picked from commit e9eecde7c67303c1dc87864c10c372019d609b0b)
(cherry picked from commit 41d4c63679c36dc63b4cc9be301ec8d8d518d33f)
2024-11-19 17:43:10 +01:00
Paolo Valente
5267faf794 block, bfq: fix switch back from soft-rt weitgh-raising
A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit d1f600fa4732dac36c71a03b790f0c829a076475)
(cherry picked from commit db891a7d6aed6cc37d681d2bbf6c9bd697059281)
(cherry picked from commit 647b877a9a8493df84a1d4abd94be089c8fed49b)
(cherry picked from commit 7eda6de0bbbfa1d05b8888b697d9b7aeffe4d64e)
(cherry picked from commit c1e076d9f4688c77dfa0f859060ae1f27a8d889e)
(cherry picked from commit db0058abb7534aeb0abebe01c65659aa3886de78)
(cherry picked from commit 40bc06529a2053ca0caf2053dd6f2a27bf7af916)
2024-11-19 17:42:58 +01:00
Paolo Valente
8b47ef547b block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.

To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 7f1995c27b19060dbdff23442f375e3097c90707)
(cherry picked from commit 12ec5a8ca2486d06f880d41751383c0d9549ba49)
(cherry picked from commit 64c6efc5ccb01edf553487aff312c0b7110cb30f)
(cherry picked from commit 3e04c1949f447a8166fa6d6343bd5332d8c12a4b)
(cherry picked from commit 40a263c36cf2094311e8189b6e9173360a808b12)
(cherry picked from commit 61a02ce46503671c747e550a13972ca8abaf5030)
(cherry picked from commit 3707ff2d32dccd807b8e5e6885f07f3874c71180)
2024-11-19 17:42:55 +01:00
Pavel Begunkov
f029d24207 splice: don't generate zero-len segement bvecs
iter_file_splice_write() may spawn bvec segments with zero-length. In
preparation for prohibiting them, filter out by hand at splice level.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 0f1d344feb534555a0dcd0beafb7211a37c5355e)
(cherry picked from commit 4c72fdc13bd20d10f59b8145627312814583a945)
(cherry picked from commit cba6a18da1cc8144a07ba6a4b03e8e8dc8d24428)
(cherry picked from commit 54a17499483118cd3c92feb747c88207ce30e9ce)
(cherry picked from commit 4dec661d05c16a8e62dd833262ff68ce3e466770)
(cherry picked from commit fe99d86b681099f662b2b01155b02b8476ff428d)
(cherry picked from commit aa033460cd26157fe81e829e4744b3396a09860b)
2024-11-19 17:42:24 +01:00
Pavel Begunkov
3c61c6aa45 bvec/iter: disallow zero-length segment bvecs
zero-length bvec segments are allowed in general, but not handled by bio
and down the block layer so filtered out. This inconsistency may be
confusing and prevent from optimisations. As zero-length segments are
useless and places that were generating them are patched, declare them
not allowed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 9b2e0016d04c6542ace0128eb82ecb3b10c97e43)
(cherry picked from commit 87afbd40acbb99860f846ad6f199e62e93be96c2)
(cherry picked from commit f0677085687d50b5ecd6e7a2e19e4aff23251cb6)
(cherry picked from commit affb154c088db678d4a541f8a4080fa5088cb10b)
(cherry picked from commit 9b383b80e8432af1d0421acf9287076db26996d7)
(cherry picked from commit f643066fcac50220888ecfe9b86c5d895d621648)
(cherry picked from commit d2f588cf9664d76f78287142f505e4f375503ae6)
2024-11-19 17:42:21 +01:00
Christoph Hellwig
8ae63d0654 target/file: allocate the bvec array as part of struct target_core_file_cmd
This saves one memory allocation, and ensures the bvecs aren't freed
before the AIO completion.  This will allow the lower level code to be
optimized so that it can avoid allocating another bvec array.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit ecd7fba0ade1d6d8d49d320df9caf96922a376b2)
(cherry picked from commit 272d2ea22b0e3da786a506896e36d3a586e6c252)
(cherry picked from commit 83ff0aa1cc08c329feb0748c575810b3ce8c0077)
(cherry picked from commit d0dc27fcc3f57d556ce4468a060e54f25c7b91b0)
(cherry picked from commit 847a30a99fc4b11c9e6cf2ec049ca20a6da9c769)
(cherry picked from commit 3799ad215edeb9276c4d16150a33de916cfa4ea1)
(cherry picked from commit ee8f417b3276049e4f0bbadf4c4524f071de2361)
2024-11-19 17:42:15 +01:00
Pavel Begunkov
f6172ea41b iov_iter: optimise bvec iov_iter_advance()
iov_iter_advance() is heavily used, but implemented through generic
means. For bvecs there is a specifically crafted function for that, so
use bvec_iter_advance() instead, it's faster and slimmer.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 54c8195b4ebe10af66b49ab9c809bc16939555fc)
(cherry picked from commit 8cac76228025fb022b1bb15e100efae8acde0425)
(cherry picked from commit c8b0dff6b5ac38ff23605bdae1c5bf62766d0fa3)
(cherry picked from commit 5bbff4ddbd3f87ddb409753269fa933109a99a7f)
(cherry picked from commit 689d9157a0b58f95cb2641a17226b023a1fb226a)
(cherry picked from commit 0df724cafe05ae311556249c7df0c2cd00e05007)
(cherry picked from commit ba5d942df07c03782ab2aa2b2dd1f7b96b3b5c52)
2024-11-19 17:42:10 +01:00
Jan Kara
b93af2c415 bfq: Use 'ttime' local variable
Use local variable 'ttime' instead of dereferencing bfqq.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 28c6def009192b673f92ea357dfb535ba15e00a4)
(cherry picked from commit bb2a213aa0a2b717c3a6e7848c6f82656d80897f)
(cherry picked from commit 2e0cfffb9a6da88cb1a786fb95618bfa714fea32)
(cherry picked from commit caff780963fdfda0ab456c24027298482d745b2f)
(cherry picked from commit b893b660ea8e998b760d48faeed2834e483158ad)
(cherry picked from commit 7e3d952af5fdcf6b02d01d55dbf658fbc2d67f41)
(cherry picked from commit 033b49f66e3808fead9e65e7c9417f26d423374f)
2024-11-19 17:42:05 +01:00
Joseph Qi
fdcb87e105 block/bfq: update comments and default value in docs for fifo_expire
Correct the comments since bfq_fifo_expire[0] is for async request,
while bfq_fifo_expire[1] is for sync request.
Also update docs, according the source code, the default
fifo_expire_async is 250ms, and fifo_expire_sync is 125ms.

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 4168a8d27ed3a00f160e7f885c956f060d2a0741)
(cherry picked from commit a31ff2eb7d7cfa8331e513bb282f304117f18a77)
(cherry picked from commit a78637befaa4106f9858b3ad8e3273960d3de82b)
(cherry picked from commit bd8e7d3845c7a3b602aee361c7e3d0b5764ce060)
(cherry picked from commit a8543954accfadbb9a1cf1f64c6b3749ee3a629b)
(cherry picked from commit 960981f44b77dcd0d4e786aaef72d39057ccfc03)
(cherry picked from commit 50cfb4b6c1c2e4a3778f66510fee7a2e86e053f2)
2024-11-19 17:41:49 +01:00
Paolo Valente
8eb5a42575 block, bfq: always inject I/O of queues blocked by wakers
Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq.  Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.

Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 2ec5a5c48373d4bc2f0699f86507a65bf0b9df35)
(cherry picked from commit 0750db9767232fc2e4850868e526f4b02ecfb247)
(cherry picked from commit 8676f43249bbb0478a8b18bd87703da59902dbfd)
(cherry picked from commit df655d250f253a2f8a6792569108f30a04b7b894)
(cherry picked from commit d76168c1c3805a2c948e7ff60c8eb341e2ff0013)
(cherry picked from commit f213ae4e575f8ed67ae065fe80d06dc957f0b068)
(cherry picked from commit eb1ff3ab6d66081fbaf007c6cfc1a5e841719c0c)
2024-11-19 17:41:42 +01:00
Jan Kara
9abe5bf065 bfq: Provide helper to generate bfqq name
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 582f04e19ad7b41df993c669805e48a01bcd9c5b)
(cherry picked from commit e030e88a4c2e220366f3db1af33d72d9638f93b5)
(cherry picked from commit e925a5fdce15f914ec2386b03bf64242792acce0)
(cherry picked from commit 9265a0e6952305932aa2b5caf2183387859dcfce)
(cherry picked from commit 41794de36673c11faca8c57625dfa50b76edde20)
(cherry picked from commit 5e830976b50a9f0a2c927b02f921f0d6ae796183)
(cherry picked from commit b5344876556e4a62cac7905bf11ca7ccf8d16d6d)
2024-11-19 17:41:18 +01:00
Yahu Gao
1297c45dcc block/bfq_wf2q: correct weight to ioprio
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.

Signed-off-by: Yahu Gao <gaoyahu19@gmail.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit bcd2be763252f3a4d5fc4d6008d4d96c601ee74b)
(cherry picked from commit 81806db867a17e49d37b1d556dd39f4da5227f56)
(cherry picked from commit aed9dbfda208b30130c64bac55570e2f89084d2b)
(cherry picked from commit 7158b54afec4b986d52cc646a5dffc30eac6dc19)
(cherry picked from commit fb4f80f773e0fc89f372c7afda9c8e9794849f67)
(cherry picked from commit 5ad409c78ed2bfca202490fa13f0a93c49f21382)
2024-11-19 17:40:48 +01:00
Jan Kara
c72c9473f5 blk: Fix lock inversion between ioc lock and bfqd lock
Lockdep complains about lock inversion between ioc->lock and bfqd->lock:

bfqd -> ioc:
 put_io_context+0x33/0x90 -> ioc->lock grabbed
 blk_mq_free_request+0x51/0x140
 blk_put_request+0xe/0x10
 blk_attempt_req_merge+0x1d/0x30
 elv_attempt_insert_merge+0x56/0xa0
 blk_mq_sched_try_insert_merge+0x4b/0x60
 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
 blk_mq_sched_insert_requests+0xd6/0x2b0
 blk_mq_flush_plug_list+0x154/0x280
 blk_finish_plug+0x40/0x60
 ext4_writepages+0x696/0x1320
 do_writepages+0x1c/0x80
 __filemap_fdatawrite_range+0xd7/0x120
 sync_file_range+0xac/0xf0

ioc->bfqd:
 bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
 put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
 exit_io_context+0x48/0x50
 do_exit+0x7e9/0xdd0
 do_group_exit+0x54/0xc0

To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.

Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit fd2ef39cc9a6b9c4c41864ac506906c52f94b06a)
(cherry picked from commit 786e392c4a7bd2559bdc1a1c6ac28d8b612a0735)
(cherry picked from commit aa8e3e1451bde73dff60f1e5110b6a3cb810e35b)
(cherry picked from commit 4deef6abb13a82b148c583d9ab37374c876fe4c2)
(cherry picked from commit 1988f864ec1c494bb54e5b9df1611195f6d923f2)
(cherry picked from commit 9dc0074b0dd8960f9e06dc1494855493ff53eb68)
(cherry picked from commit c937983724111bb4526e34da0d5c6c8aea1902af)
2024-11-19 17:40:26 +01:00
Johannes Weiner
90c0c9aa4a cgroup: rstat: punt root-level optimization to individual controllers
Current users of the rstat code can source root-level statistics from
the native counters of their respective subsystem, allowing them to
forego aggregation at the root level.  This optimization is currently
implemented inside the generic rstat code, which doesn't track the root
cgroup and doesn't invoke the subsystem flush callbacks on it.

However, the memory controller cannot do this optimization, because
cgroup1 breaks out memory specifically for the local level, including at
the root level.  In preparation for the memory controller switching to
rstat, move the optimization from rstat core to the controllers.

Afterwards, rstat will always track the root cgroup for changes and
invoke the subsystem callbacks on it; and it's up to the subsystem to
special-case and skip aggregation of the root cgroup if it can source
this information through other, cheaper means.

This is the case for the io controller and the cgroup base stats.  In
their respective flush callbacks, check whether the parent is the root
cgroup, and if so, skip the unnecessary upward propagation.

The extra cost of tracking the root cgroup is negligible: on stat
changes, we actually remove a branch that checks for the root.  The
queueing for a flush touches only per-cpu data, and only the first stat
change since a flush requires a (per-cpu) lock.

Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit dc26532aed0ab25c0801a34640d1f3b9b9098a48)
(cherry picked from commit 69da183fcd0112af130879a1c93113a941e2241b)
(cherry picked from commit ddf1013871482b246147e71a04c865c1be5cf74d)
(cherry picked from commit 30fcd52e18dd1d508b1b22f7c660ac22de734f67)
(cherry picked from commit 19c9a1b9d9ae9a4f359deaf89101f9013254f43d)
(cherry picked from commit 0b4286aea9bb0a6ea6acb723f8396e476044190b)
2024-11-19 17:40:21 +01:00
Nahuel Gómez
a4d33f6631 block: ssg-iosched: adapt to new patches
../block/ssg-iosched.c:684:41: error: too few arguments to function call, expected 3, have 2
  684 |         if (blk_mq_sched_try_insert_merge(q, rq))
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~      ^
../block/blk-mq-sched.h:15:6: note: 'blk_mq_sched_try_insert_merge' declared here
   15 | bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
      |      ^                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   16 |                                    struct list_head *free);
      |                                    ~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
2024-11-19 17:40:09 +01:00
Nahuel Gómez
82eba12440 exynos-pm: fix build without CONFIG_SEC_PM_DEBUG
We remove the checks to allow the function to be used anyway.

../drivers/soc/samsung/exynos-pm/exynos-pm.c:107:10: error: declaration of 'struct wakeup_stat_name' will not be visible outside of this function [-Werror,-Wvisibility]
  107 |                 struct wakeup_stat_name *ws_names)
      |                        ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:114:18: error: incomplete definition of type 'struct wakeup_stat_name'
  114 |                 name = ws_names->name[bit];
      |                        ~~~~~~~~^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:107:10: note: forward declaration of 'struct wakeup_stat_name'
  107 |                 struct wakeup_stat_name *ws_names)
      |                        ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:131:25: error: no member named 'ws_names' in 'struct exynos_pm_info'
  131 |         if (unlikely(!pm_info->ws_names))
      |                       ~~~~~~~  ^
../include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
   78 | # define unlikely(x)    __builtin_expect(!!(x), 0)
      |                                             ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:143:51: error: no member named 'ws_names' in 'struct exynos_pm_info'
  143 |                 exynos_show_wakeup_reason_sysint(wss, &pm_info->ws_names[i]);
      |                                                        ~~~~~~~  ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:465:11: error: no member named 'ws_names' in 'struct exynos_pm_info'
  465 |         pm_info->ws_names = kzalloc(sizeof(*pm_info->ws_names) * n, GFP_KERNEL);
      |         ~~~~~~~  ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:465:47: error: no member named 'ws_names' in 'struct exynos_pm_info'
  465 |         pm_info->ws_names = kzalloc(sizeof(*pm_info->ws_names) * n, GFP_KERNEL);
      |                                             ~~~~~~~  ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:466:16: error: no member named 'ws_names' in 'struct exynos_pm_info'
  466 |         if (!pm_info->ws_names)
      |              ~~~~~~~  ^
../drivers/soc/samsung/exynos-pm/exynos-pm.c:478:14: error: no member named 'ws_names' in 'struct exynos_pm_info'
  478 |                                 pm_info->ws_names[idx].name, size);
      |                                 ~~~~~~~  ^
8 errors generated.

Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
2024-11-19 17:39:21 +01:00
Nahuel Gómez
7059d8baa3 kernel: sysctl: add init protection to common mm-related nodes
The protected nodes are:
* dirty_ratio
* dirty_background_ratio
* dirty_bytes
* dirty_background_bytes
* dirty_expire_centisecs
* dirty_writeback_centisecs
* swappiness

This approach is inspired by [1] and makes use of the node tampering blacklist.

[1]: 239efdc263

Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
2024-11-19 17:39:17 +01:00
Nahuel Gómez
bfb3710a7c mm: new writeback and swappiness values from Ktweak
Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>
2024-11-19 17:39:12 +01:00
Adam W. Willis
bcec04dde1 mm: apply init protection
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Change-Id: I1a1928fec9efeb29203a94644388c3ca48e7d96e
[TogoFire]: adapt to k5.4.
Signed-off-by: TogoFire <togofire@mailfence.com>
2024-11-19 17:39:06 +01:00
Uladzislau Rezki
d0dc26b405 workqueue: Make queue_rcu_work() use call_rcu_flush()
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them.  This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing.  This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.

This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.

Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory.  If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order.  Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.

However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked.  It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_flush() instead of
call_rcu().  The arrival of a non-lazy call_rcu_flush() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU.  After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.

Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_flush() for the few places
where laziness is inappropriate.

And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.

Therefore, make queue_rcu_work() use call_rcu_flush() in order to revert
to the old behavior.

Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-19 17:37:56 +01:00
Sultan Alsawaf
fa6b06bf46 sched/fair: Always update CPU capacity when load balancing
Limiting CPU capacity updates, which are quite cheap, results in worse
balancing decisions during opportunistic balancing (e.g., SD_BALANCE_WAKE).
This causes opportunistic placement decisions to be skewed using stale CPU
capacity data, and when a CPU isn't idling much, its capacity suffers from
even more staleness since the only exception to the 100 ms capacity update
ratelimit is a CPU exiting idle.

Since the capacity updates are cheap, always do it when load balancing in
order to improve opportunistic task placement decisions.

Change-Id: If1d451ce742fd093010057e31e71012d47fad70a
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2024-11-19 17:34:49 +01:00
Joel Fernandes (Google)
323a4009a4 rcu: Avoid unnecessary softirq when system is idle
When there are no callbacks pending on an idle system, I noticed that
RCU softirq is continuously firing. During this the cpu_no_qs is set to
false, and core_needs_qs is set to true indefinitely. This causes
rcu_process_callbacks to be repeatedly called, even though the node
corresponding to the CPU has that CPU's mask bit cleared and the system
is idle. I believe the race is when such mask clearing is done during
idle CPU scan of the quiescent state forcing stage in the kthread
instead of the softirq. Since the rnp mask is cleared, but the flags on
the CPU's rdp are not cleared, the CPU thinks it still needs to report
to core RCU.

Cure this by clearing the core_needs_qs flag when the CPU detects that
its node is already updated which will avoid the unwanted softirq raises
to the benefit of real-time systems.

Test: Ran rcutorture for various tree RCU configs.

Change-Id: Iee374d1dcdc74ecc5e6816a99be51feddd876931
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
2024-11-19 17:34:20 +01:00
Tyler Nijmeh
35df832f54 mm: oom_kill: Do not dump tasks by default
This takes RCU and tasks locks to simply print debugging information,
which with certain log levels will not even display. Disable this by
default.

Change-Id: I952dba176f955239061acc7b178d88fceff8ecdf

Signed-off-by: RyuujiX <saputradenny712@gmail.com>
Signed-off-by: onettboots <blackcocopet@gmail.com>
2024-11-19 17:33:46 +01:00
Panchajanya1999
0ab2f838a5 tcp: Force the TCP no-delay option for everything
Forcing TCP no-delay will disable Nagle's algorithm, which basically collects small outgoing packets to send all at once. Disabling this will lead to all the packets being sent at their respective times, leading to better latency.

Read https://brooker.co.za/blog/2024/05/09/nagle.html for details.

Signed-off-by: prathamdubey2005 <134331217+prathamdubey2005@users.noreply.github.com>
2024-11-19 17:33:40 +01:00
gustavoss
9a9f44e174 Optimized Console FrameBuffer for upto 70% increase in Performance
Signed-off-by: Joe Maples <joe@frap129.org>
Signed-off-by: John Vincent <git@tenseventyseven.cf>
2024-11-19 17:30:21 +01:00
Ksawlii
9b077df9ac Revert "net: mac802154: Fix racy device stats updates by DEV_STATS_INC() and DEV_STATS_ADD()"
This reverts commit 97f5298d5c.
2024-11-19 14:52:14 +01:00
Greg Kroah-Hartman
b2c91c005b Linux 5.10.223
Link: https://lore.kernel.org/r/20240725142733.262322603@linuxfoundation.org
Tested-by: ChromeOS CQ Test <chromeos-kernel-stable-merge@google.com>
Tested-by: Pavel Machek (CIP) <pavel@denx.de>
Tested-by: kernelci.org bot <bot@kernelci.org>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:53 +01:00
Si-Wei Liu
fadb305e68 tap: add missing verification for short frame
commit ed7f2afdd0e043a397677e597ced0830b83ba0b3 upstream.

The cited commit missed to check against the validity of the frame length
in the tap_get_user_xdp() path, which could cause a corrupted skb to be
sent downstack. Even before the skb is transmitted, the
tap_get_user_xdp()-->skb_set_network_header() may assume the size is more
than ETH_HLEN. Once transmitted, this could either cause out-of-bound
access beyond the actual length, or confuse the underlayer with incorrect
or inconsistent header length in the skb metadata.

In the alternative path, tap_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted.

This is to drop any frame shorter than the Ethernet header size just like
how tap_get_user() does.

CVE: CVE-2024-41090
Link: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
Fixes: 0efac27791ee ("tap: accept an array of XDP buffs through sendmsg()")
Cc: stable@vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240724170452.16837-2-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:53 +01:00
Dongli Zhang
4ab91c0fbc tun: add missing verification for short frame
commit 049584807f1d797fc3078b68035450a9769eb5c3 upstream.

The cited commit missed to check against the validity of the frame length
in the tun_xdp_one() path, which could cause a corrupted skb to be sent
downstack. Even before the skb is transmitted, the
tun_xdp_one-->eth_type_trans() may access the Ethernet header although it
can be less than ETH_HLEN. Once transmitted, this could either cause
out-of-bound access beyond the actual length, or confuse the underlayer
with incorrect or inconsistent header length in the skb metadata.

In the alternative path, tun_get_user() already prohibits short frame which
has the length less than Ethernet header size from being transmitted for
IFF_TAP.

This is to drop any frame shorter than the Ethernet header size just like
how tun_get_user() does.

CVE: CVE-2024-41091
Inspired-by: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Cc: stable@vger.kernel.org
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240724170452.16837-3-dongli.zhang@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:53 +01:00
Jann Horn
dadacb32b4 filelock: Fix fcntl/close race recovery compat path
commit f8138f2ad2f745b9a1c696a05b749eabe44337ea upstream.

When I wrote commit 3cad1bc01041 ("filelock: Remove locks reliably when
fcntl/close race is detected"), I missed that there are two copies of the
code I was patching: The normal version, and the version for 64-bit offsets
on 32-bit kernels.
Thanks to Greg KH for stumbling over this while doing the stable
backport...

Apply exactly the same fix to the compat path for 32-bit kernels.

Fixes: c293621bbf67 ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240723-fs-lock-recover-compatfix-v1-1-148096719529@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:53 +01:00
Shengjiu Wang
9493326651 ALSA: pcm_dmaengine: Don't synchronize DMA channel when DMA is paused
commit 88e98af9f4b5b0d60c1fe7f7f2701b5467691e75 upstream.

When suspended, the DMA channel may enter PAUSE state if dmaengine_pause()
is supported by DMA.
At this state, dmaengine_synchronize() should not be called, otherwise
the DMA channel can't be resumed successfully.

Fixes: e8343410ddf0 ("ALSA: dmaengine: Synchronize dma channel after drop()")
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/1721198693-27636-1-git-send-email-shengjiu.wang@nxp.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:52 +01:00
Seunghun Han
b6bbfcc48c ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book Pro 360
commit d7063c08738573fc2f3296da6d31a22fa8aa843a upstream.

Samsung Galaxy Book Pro 360 (13" 2022 NT935QDB-KC71S) with codec SSID
144d:c1a4 requires the same workaround to enable the speaker amp
as other Samsung models with the ALC298 codec.

Signed-off-by: Seunghun Han <kkamagui@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20240718080908.8677-1-kkamagui@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Edson Juliano Drosdeck
45ee25ef66 ALSA: hda/realtek: Enable headset mic on Positivo SU C1400
commit 8fc1e8b230771442133d5cf5fa4313277aa2bb8b upstream.

Positivo SU C1400 is equipped with ALC256, and it needs
ALC269_FIXUP_ASPIRE_HEADSET_MIC quirk to make its headset mic work.

Signed-off-by: Edson Juliano Drosdeck <edson.drosdeck@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20240712180642.22564-1-edson.drosdeck@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
lei lu
fac2aa0b41 jfs: don't walk off the end of ealist
commit d0fa70aca54c8643248e89061da23752506ec0d4 upstream.

Add a check before visiting the members of ea to
make sure each ea stays within the ealist.

Signed-off-by: lei lu <llfamsec@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
lei lu
1494ceb608 ocfs2: add bounds checking to ocfs2_check_dir_entry()
commit 255547c6bb8940a97eea94ef9d464ea5967763fb upstream.

This adds sanity checks for ocfs2_dir_entry to make sure all members of
ocfs2_dir_entry don't stray beyond valid memory region.

Link: https://lkml.kernel.org/r/20240626104433.163270-1-llfamsec@gmail.com
Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Paolo Abeni
c246f2bb3d net: relax socket state check at accept time.
commit 26afda78cda3da974fd4c287962c169e9462c495 upstream.

Christoph reported the following splat:

WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0
Modules linked in:
CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759
Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80
RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293
RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64
R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000
R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800
FS:  000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786
 do_accept+0x435/0x620 net/socket.c:1929
 __sys_accept4_file net/socket.c:1969 [inline]
 __sys_accept4+0x9b/0x110 net/socket.c:1999
 __do_sys_accept net/socket.c:2016 [inline]
 __se_sys_accept net/socket.c:2013 [inline]
 __x64_sys_accept+0x7d/0x90 net/socket.c:2013
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x4315f9
Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300
R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055
 </TASK>

The reproducer invokes shutdown() before entering the listener status.
After commit 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for
TCP_SYN_RECV sockets"), the above causes the child to reach the accept
syscall in FIN_WAIT1 status.

Eric noted we can relax the existing assertion in __inet_accept()

Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Dan Carpenter
aeaf2ac6b3 drm/amdgpu: Fix signedness bug in sdma_v4_0_process_trap_irq()
commit 6769a23697f17f9bf9365ca8ed62fe37e361a05a upstream.

The "instance" variable needs to be signed for the error handling to work.

Fixes: 8b2faf1a4f3b ("drm/amdgpu: add error handle to avoid out-of-bounds")
Reviewed-by: Bob Zhou <bob.zhou@amd.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Siddh Raman Pant <siddh.raman.pant@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Gabriel Krisman Bertazi
54d36cabde ext4: fix error code saved on super block during file system abort
commit 124e7c61deb27d758df5ec0521c36cf08d417f7a upstream.

ext4_abort will eventually call ext4_errno_to_code, which translates the
errno to an EXT4_ERR specific error.  This means that ext4_abort expects
an errno.  By using EXT4_ERR_ here, it gets misinterpreted (as an errno),
and ends up saving EXT4_ERR_EBUSY on the superblock during an abort,
which makes no sense.

ESHUTDOWN will get properly translated to EXT4_ERR_SHUTDOWN, so use that
instead.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20211026173302.84000-1-krisman@collabora.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Bart Van Assche
7be7fa0f35 scsi: core: Fix a use-after-free
commit 8fe4ce5836e932f5766317cb651c1ff2a4cd0506 upstream.

There are two .exit_cmd_priv implementations. Both implementations use
resources associated with the SCSI host. Make sure that these resources are
still available when .exit_cmd_priv is called by waiting inside
scsi_remove_host() until the tag set has been freed.

This commit fixes the following use-after-free:

==================================================================
BUG: KASAN: use-after-free in srp_exit_cmd_priv+0x27/0xd0 [ib_srp]
Read of size 8 at addr ffff888100337000 by task multipathd/16727
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 print_report.cold+0x5e/0x5db
 kasan_report+0xab/0x120
 srp_exit_cmd_priv+0x27/0xd0 [ib_srp]
 scsi_mq_exit_request+0x4d/0x70
 blk_mq_free_rqs+0x143/0x410
 __blk_mq_free_map_and_rqs+0x6e/0x100
 blk_mq_free_tag_set+0x2b/0x160
 scsi_host_dev_release+0xf3/0x1a0
 device_release+0x54/0xe0
 kobject_put+0xa5/0x120
 device_release+0x54/0xe0
 kobject_put+0xa5/0x120
 scsi_device_dev_release_usercontext+0x4c1/0x4e0
 execute_in_process_context+0x23/0x90
 device_release+0x54/0xe0
 kobject_put+0xa5/0x120
 scsi_disk_release+0x3f/0x50
 device_release+0x54/0xe0
 kobject_put+0xa5/0x120
 disk_release+0x17f/0x1b0
 device_release+0x54/0xe0
 kobject_put+0xa5/0x120
 dm_put_table_device+0xa3/0x160 [dm_mod]
 dm_put_device+0xd0/0x140 [dm_mod]
 free_priority_group+0xd8/0x110 [dm_multipath]
 free_multipath+0x94/0xe0 [dm_multipath]
 dm_table_destroy+0xa2/0x1e0 [dm_mod]
 __dm_destroy+0x196/0x350 [dm_mod]
 dev_remove+0x10c/0x160 [dm_mod]
 ctl_ioctl+0x2c2/0x590 [dm_mod]
 dm_ctl_ioctl+0x5/0x10 [dm_mod]
 __x64_sys_ioctl+0xb4/0xf0
 dm_ctl_ioctl+0x5/0x10 [dm_mod]
 __x64_sys_ioctl+0xb4/0xf0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Link: https://lore.kernel.org/r/20220826002635.919423-1-bvanassche@acm.org
Fixes: 65ca846a5314 ("scsi: core: Introduce {init,exit}_cmd_priv()")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: John Garry <john.garry@huawei.com>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Reported-by: Li Zhijian <lizhijian@fujitsu.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[mheyne: fixed contextual conflicts:
  - drivers/scsi/hosts.c: due to missing commit 973dac8a8a14 ("scsi: core: Refine how we set tag_set NUMA node")
  - drivers/scsi/scsi_sysfs.c: due to missing commit 6f8191fdf41d ("block: simplify disk shutdown")
  - drivers/scsi/scsi_scan.c: due to missing commit 59506abe5e34 ("scsi: core: Inline scsi_mq_alloc_queue()")]
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Jason Xing
a4599f9d80 bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue
commit 6648e613226e18897231ab5e42ffc29e63fa3365 upstream.

Fix NULL pointer data-races in sk_psock_skb_ingress_enqueue() which
syzbot reported [1].

[1]
BUG: KCSAN: data-race in sk_psock_drop / sk_psock_skb_ingress_enqueue

write to 0xffff88814b3278b8 of 8 bytes by task 10724 on cpu 1:
 sk_psock_stop_verdict net/core/skmsg.c:1257 [inline]
 sk_psock_drop+0x13e/0x1f0 net/core/skmsg.c:843
 sk_psock_put include/linux/skmsg.h:459 [inline]
 sock_map_close+0x1a7/0x260 net/core/sock_map.c:1648
 unix_release+0x4b/0x80 net/unix/af_unix.c:1048
 __sock_release net/socket.c:659 [inline]
 sock_close+0x68/0x150 net/socket.c:1421
 __fput+0x2c1/0x660 fs/file_table.c:422
 __fput_sync+0x44/0x60 fs/file_table.c:507
 __do_sys_close fs/open.c:1556 [inline]
 __se_sys_close+0x101/0x1b0 fs/open.c:1541
 __x64_sys_close+0x1f/0x30 fs/open.c:1541
 do_syscall_64+0xd3/0x1d0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

read to 0xffff88814b3278b8 of 8 bytes by task 10713 on cpu 0:
 sk_psock_data_ready include/linux/skmsg.h:464 [inline]
 sk_psock_skb_ingress_enqueue+0x32d/0x390 net/core/skmsg.c:555
 sk_psock_skb_ingress_self+0x185/0x1e0 net/core/skmsg.c:606
 sk_psock_verdict_apply net/core/skmsg.c:1008 [inline]
 sk_psock_verdict_recv+0x3e4/0x4a0 net/core/skmsg.c:1202
 unix_read_skb net/unix/af_unix.c:2546 [inline]
 unix_stream_read_skb+0x9e/0xf0 net/unix/af_unix.c:2682
 sk_psock_verdict_data_ready+0x77/0x220 net/core/skmsg.c:1223
 unix_stream_sendmsg+0x527/0x860 net/unix/af_unix.c:2339
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0x140/0x180 net/socket.c:745
 ____sys_sendmsg+0x312/0x410 net/socket.c:2584
 ___sys_sendmsg net/socket.c:2638 [inline]
 __sys_sendmsg+0x1e9/0x280 net/socket.c:2667
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x46/0x50 net/socket.c:2674
 do_syscall_64+0xd3/0x1d0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

value changed: 0xffffffff83d7feb0 -> 0x0000000000000000

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10713 Comm: syz-executor.4 Tainted: G        W          6.8.0-syzkaller-08951-gfe46a7dd189e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024

Prior to this, commit 4cd12c6065df ("bpf, sockmap: Fix NULL pointer
dereference in sk_psock_verdict_data_ready()") fixed one NULL pointer
similarly due to no protection of saved_data_ready. Here is another
different caller causing the same issue because of the same reason. So
we should protect it with sk_callback_lock read lock because the writer
side in the sk_psock_drop() uses "write_lock_bh(&sk->sk_callback_lock);".

To avoid errors that could happen in future, I move those two pairs of
lock into the sk_psock_data_ready(), which is suggested by John Fastabend.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: syzbot+aa8c8ec2538929f18f2d@syzkaller.appspotmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=aa8c8ec2538929f18f2d
Link: https://lore.kernel.org/all/20240329134037.92124-1-kerneljasonxing@gmail.com
Link: https://lore.kernel.org/bpf/20240404021001.94815-1-kerneljasonxing@gmail.com
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Daniel Borkmann
33c88d138d bpf: Fix overrunning reservations in ringbuf
commit cfa1a2329a691ffd991fcf7248a57d752e712881 upstream.

The BPF ring buffer internally is implemented as a power-of-2 sized circular
buffer, with two logical and ever-increasing counters: consumer_pos is the
consumer counter to show which logical position the consumer consumed the
data, and producer_pos which is the producer counter denoting the amount of
data reserved by all producers.

Each time a record is reserved, the producer that "owns" the record will
successfully advance producer counter. In user space each time a record is
read, the consumer of the data advanced the consumer counter once it finished
processing. Both counters are stored in separate pages so that from user
space, the producer counter is read-only and the consumer counter is read-write.

One aspect that simplifies and thus speeds up the implementation of both
producers and consumers is how the data area is mapped twice contiguously
back-to-back in the virtual memory, allowing to not take any special measures
for samples that have to wrap around at the end of the circular buffer data
area, because the next page after the last data page would be first data page
again, and thus the sample will still appear completely contiguous in virtual
memory.

Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for
book-keeping the length and offset, and is inaccessible to the BPF program.
Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ`
for the BPF program to use. Bing-Jhong and Muhammad reported that it is however
possible to make a second allocated memory chunk overlapping with the first
chunk and as a result, the BPF program is now able to edit first chunk's
header.

For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size
of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to
bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in
[0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets
allocate a chunk B with size 0x3000. This will succeed because consumer_pos
was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask`
check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able
to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned
earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data
pages. This means that chunk B at [0x4000,0x4008] is chunk A's header.
bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then
locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk
B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong
page and could cause a crash.

Fix it by calculating the oldest pending_pos and check whether the range
from the oldest outstanding record to the newest would span beyond the ring
buffer size. If that is the case, then reject the request. We've tested with
the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh)
before/after the fix and while it seems a bit slower on some benchmarks, it
is still not significantly enough to matter.

Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Co-developed-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:51 +01:00
Kuan-Wei Chiu
38ee79d89a ACPI: processor_idle: Fix invalid comparison with insertion sort for latency
commit 233323f9b9f828cd7cd5145ad811c1990b692542 upstream.

The acpi_cst_latency_cmp() comparison function currently used for
sorting C-state latencies does not satisfy transitivity, causing
incorrect sorting results.

Specifically, if there are two valid acpi_processor_cx elements A and B
and one invalid element C, it may occur that A < B, A = C, and B = C.
Sorting algorithms assume that if A < B and A = C, then C < B, leading
to incorrect ordering.

Given the small size of the array (<=8), we replace the library sort
function with a simple insertion sort that properly ignores invalid
elements and sorts valid ones based on latency. This change ensures
correct ordering of the C-state latencies.

Fixes: 65ea8f2c6e23 ("ACPI: processor idle: Fix up C-state latency if not ordered")
Reported-by: Julian Sikorski <belegdol@gmail.com>
Closes: https://lore.kernel.org/lkml/70674dc7-5586-4183-8953-8095567e73df@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Tested-by: Julian Sikorski <belegdol@gmail.com>
Cc: All applicable <stable@vger.kernel.org>
Link: https://patch.msgid.link/20240701205639.117194-1-visitorckw@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:50 +01:00
Masahiro Yamada
009969cd50 ARM: 9324/1: fix get_user() broken with veneer
commit 24d3ba0a7b44c1617c27f5045eecc4f34752ab03 upstream.

The 32-bit ARM kernel stops working if the kernel grows to the point
where veneers for __get_user_* are created.

AAPCS32 [1] states, "Register r12 (IP) may be used by a linker as a
scratch register between a routine and any subroutine it calls. It
can also be used within a routine to hold intermediate values between
subroutine calls."

However, bl instructions buried within the inline asm are unpredictable
for compilers; hence, "ip" must be added to the clobber list.

This becomes critical when veneers for __get_user_* are created because
veneers use the ip register since commit 02e541db0540 ("ARM: 8323/1:
force linker to use PIC veneers").

[1]: https://github.com/ARM-software/abi-aa/blob/2023Q1/aapcs32/aapcs32.rst

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: John Stultz <jstultz@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-19 14:19:50 +01:00
David Lechner
395b1f694e spi: mux: set ctlr->bits_per_word_mask
[ Upstream commit c8bd922d924bb4ab6c6c488310157d1a27996f31 ]

Like other SPI controller flags, bits_per_word_mask may be used by a
peripheral driver, so it needs to reflect the capabilities of the
underlying controller.

Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://patch.msgid.link/20240708-spi-mux-fix-v1-3-6c8845193128@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-19 14:19:50 +01:00