kernel_samsung_a53x

Author	SHA1	Message	Date
Christoph Hellwig	38f90965d6	block: remove the blk_flush_integrity call in blk_integrity_unregister [ Upstream commit e8bc14d116aeac8f0f133ec8d249acf4e0658da7 ] Now that there are no indirect calls for PI processing there is no way to dereference a NULL pointer here. Additionally drivers now always freeze the queue (or in case of stacking drivers use their internal equivalent) around changing the integrity profile. This is effectively a revert of commit 3df49967f6f1 ("block: flush the integrity workqueue in blk_integrity_unregister"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240613084839.1044015-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-19 14:52:22 +01:00
Christoph Hellwig	41a33c1066	block: initialize integrity buffer to zero before writing it to media commit 899ee2c3829c5ac14bfc7d3c4a5846c0b709b78f upstream. Metadata added by bio_integrity_prep is using plain kmalloc, which leads to random kernel memory being written media. For PI metadata this is limited to the app tag that isn't used by kernel generated metadata, but for non-PI metadata the entire buffer leaks kernel memory. Fix this by adding the __GFP_ZERO flag to allocations for writes. Fixes: 7ba1ba12eeef ("block: Block layer data integrity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-01-19 14:52:22 +01:00
Arjan van de Ven	54bab59901	fs: ext4: fsync: optimize double-fsync() a bunch There are cases where EXT4 is a bit too conservative sending barriers down to the disk; there are cases where the transaction in progress is not the one that sent the barrier (in other words: the fsync is for a file for which the IO happened more time ago and all data was already sent to the disk). For that case, a more performing tradeoff can be made on SSD devices (which have the ability to flush their dram caches in a hurry on a power fail event) where the barrier gets sent to the disk, but we don't need to wait for the barrier to complete. Any consecutive IO will block on the barrier correctly. Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> (cherry picked from commit 74aa09a7751e438bd15b5cd73f611021b7239240) (cherry picked from commit fa3bdf1a32cac074ff52403cb9ce18eb18c7f7d1)	2025-01-16 22:01:03 +01:00
ztc1997	8d7a99febf	block: Do not allow boosters to adjusting scheduler	2025-01-16 21:58:52 +01:00
Nathan Chancellor	046daa8e2f	blk-iocost: Avoid using clamp() on inuse in __propagate_weights() [ Upstream commit 57e420c84f9ab55ba4c5e2ae9c5f6c8e1ea834d2 ] After a recent change to clamp() and its variants [1] that increases the coverage of the check that high is greater than low because it can be done through inlining, certain build configurations (such as s390 defconfig) fail to build with clang with: block/blk-iocost.c:1101:11: error: call to '__compiletime_assert_557' declared with 'error' attribute: clamp() low limit 1 greater than high limit active 1101 \| inuse = clamp_t(u32, inuse, 1, active); \| ^ include/linux/minmax.h:218:36: note: expanded from macro 'clamp_t' 218 \| #define clamp_t(type, val, lo, hi) __careful_clamp(type, val, lo, hi) \| ^ include/linux/minmax.h:195:2: note: expanded from macro '__careful_clamp' 195 \| __clamp_once(type, val, lo, hi, __UNIQUE_ID(v_), __UNIQUE_ID(l_), __UNIQUE_ID(h_)) \| ^ include/linux/minmax.h:188:2: note: expanded from macro '__clamp_once' 188 \| BUILD_BUG_ON_MSG(statically_true(ulo > uhi), \ \| ^ __propagate_weights() is called with an active value of zero in ioc_check_iocgs(), which results in the high value being less than the low value, which is undefined because the value returned depends on the order of the comparisons. The purpose of this expression is to ensure inuse is not more than active and at least 1. This could be written more simply with a ternary expression that uses min(inuse, active) as the condition so that the value of that condition can be used if it is not zero and one if it is. Do this conversion to resolve the error and add a comment to deter people from turning this back into clamp(). Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Link: https://lore.kernel.org/r/34d53778977747f19cce2abb287bb3e6@AcuMS.aculab.com/ [1] Suggested-by: David Laight <david.laight@aculab.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Closes: https://lore.kernel.org/llvm/CA+G9fYsD7mw13wredcZn0L-KBA3yeoVSTuxnss-AEWMN3ha0cA@mail.gmail.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412120322.3GfVe3vF-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-02 17:01:17 +01:00
Muchun Song	e9ab785723	block: fix ordering between checking BLK_MQ_S_STOPPED request adding commit 96a9fe64bfd486ebeeacf1e6011801ffe89dae18 upstream. Supposing first scenario with a virtio_blk driver. CPU0 CPU1 blk_mq_try_issue_directly() __blk_mq_issue_directly() q->mq_ops->queue_rq() virtio_queue_rq() blk_mq_stop_hw_queue() virtblk_done() blk_mq_request_bypass_insert() 1) store blk_mq_start_stopped_hw_queue() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load return __blk_mq_sched_dispatch_requests() Supposing another scenario. CPU0 CPU1 blk_mq_requeue_work() blk_mq_insert_request() 1) store virtblk_done() blk_mq_start_stopped_hw_queue() blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load continue blk_mq_run_hw_queue() Both scenarios are similar, the full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU will not rerun the hardware queue causing starvation of the request. The easy way to fix it is to add the essential full memory barrier into helper of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue is not stopped most of the time), we only insert the barrier into the slow path. Actually, only slow path needs to care about missing of dispatching the request to the low-level device driver. Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-17 13:24:19 +01:00
Ksawlii	a34bd6102b	Revert "block: remove the blk_flush_integrity call in blk_integrity_unregister" This reverts commit `a5698c3795`.	2024-11-24 00:23:47 +01:00
Ksawlii	5ac9b91c6d	Revert "block: initialize integrity buffer to zero before writing it to media" This reverts commit `9e8e62c4a8`.	2024-11-24 00:23:47 +01:00
Ksawlii	293f547e21	Revert "block, bfq: fix possible UAF for bfqq->bic with merge chain" This reverts commit `7c6f64c78f`.	2024-11-24 00:23:28 +01:00
Ksawlii	8b80716260	Revert "block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()" This reverts commit `9ea4bad25e`.	2024-11-24 00:23:28 +01:00
Ksawlii	5af498c438	Revert "block, bfq: don't break merge chain in bfq_split_bfqq()" This reverts commit `25901091f1`.	2024-11-24 00:23:28 +01:00
Ksawlii	9f6606ad3a	Revert "block: print symbolic error name instead of error code" This reverts commit `5e674af9b4`.	2024-11-24 00:23:28 +01:00
Ksawlii	b18c91a11b	Revert "block: fix potential invalid pointer dereference in blk_add_partition" This reverts commit `f081315218`.	2024-11-24 00:23:28 +01:00
Ksawlii	3eaf19ddc8	Revert "blk_iocost: fix more out of bound shifts" This reverts commit `31aab2e514`.	2024-11-24 00:23:10 +01:00
Ksawlii	89c3aaef2c	Revert "blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race" This reverts commit `6df069c465`.	2024-11-24 00:22:52 +01:00
Yu Kuai	69f121433f	block, bfq: fix procress reference leakage for bfqq in merge chain [ Upstream commit 73aeab373557fa6ee4ae0b742c6211ccd9859280 ] Original state: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ \| \| \| \--------------\ \-------------\ \-------------\\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()"), if P1 issues a new IO: Without the patch: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ \| \| \| \------------------------------\ \-------------\\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 bfqq3 will be used to handle IO from P1, this is not expected, IO should be redirected to bfqq4; With the patch: ------------------------------------------- \| \| Process 1 Process 2 Process 3 \| Process 4 (BIC1) (BIC2) (BIC3) \| (BIC4) \| \| \| \| \-------------\ \-------------\\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 IO is redirected to bfqq4, however, procress reference of bfqq3 is still 2, while there is only P2 using it. Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify code. Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:22:00 +01:00
Omar Sandoval	6df069c465	blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race commit e972b08b91ef48488bae9789f03cfedb148667fb upstream. We're seeing crashes from rq_qos_wake_function that look like this: BUG: unable to handle page fault for address: ffffafe180a40084 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0 Oops: Oops: 0002 [#1] PREEMPT SMP PTI CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40 Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00 RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084 RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011 R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002 R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> try_to_wake_up+0x5a/0x6a0 rq_qos_wake_function+0x71/0x80 __wake_up_common+0x75/0xa0 __wake_up+0x36/0x60 scale_up.part.0+0x50/0x110 wb_timer_fn+0x227/0x450 ... So rq_qos_wake_function() calls wake_up_process(data->task), which calls try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock). p comes from data->task, and data comes from the waitqueue entry, which is stored on the waiter's stack in rq_qos_wait(). Analyzing the core dump with drgn, I found that the waiter had already woken up and moved on to a completely unrelated code path, clobbering what was previously data->task. Meanwhile, the waker was passing the clobbered garbage in data->task to wake_up_process(), leading to the crash. What's happening is that in between rq_qos_wake_function() deleting the waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding that it already got a token and returning. The race looks like this: rq_qos_wait() rq_qos_wake_function() ============================================================== prepare_to_wait_exclusive() data->got_token = true; list_del_init(&curr->entry); if (data.got_token) break; finish_wait(&rqw->wait, &data.wq); ^- returns immediately because list_empty_careful(&wq_entry->entry) is true ... return, go do something else ... wake_up_process(data->task) (NO LONGER VALID!)-^ Normally, finish_wait() is supposed to synchronize against the waker. But, as noted above, it is returning immediately because the waitqueue entry has already been removed from the waitqueue. The bug is that rq_qos_wake_function() is accessing the waitqueue entry AFTER deleting it. Note that autoremove_wake_function() wakes the waiter and THEN deletes the waitqueue entry, which is the proper order. Fix it by swapping the order. We also need to use list_del_init_careful() to match the list_empty_careful() in finish_wait(). Fixes: 38cfb5a45ee0 ("blk-wbt: improve waking of tasks") Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:21:55 +01:00
Konstantin Ovsepian	31aab2e514	blk_iocost: fix more out of bound shifts [ Upstream commit 9bce8005ec0dcb23a58300e8522fe4a31da606fa ] Recently running UBSAN caught few out of bound shifts in the ioc_forgive_debts() function: UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... Call Trace: <IRQ> dump_stack_lvl+0xca/0x130 __ubsan_handle_shift_out_of_bounds+0x22c/0x280 ? __lock_acquire+0x6441/0x7c10 ioc_timer_fn+0x6cec/0x7750 ? blk_iocost_init+0x720/0x720 ? call_timer_fn+0x5d/0x470 call_timer_fn+0xfa/0x470 ? blk_iocost_init+0x720/0x720 __run_timer_base+0x519/0x700 ... Actual impact of this issue was not identified but I propose to fix the undefined behaviour. The proposed fix to prevent those out of bound shifts consist of precalculating exponent before using it the shift operations by taking min value from the actual exponent and maximum possible number of bits. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Konstantin Ovsepian <ovs@ovs.to> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:38 +01:00
Riyan Dhiman	f081315218	block: fix potential invalid pointer dereference in blk_add_partition [ Upstream commit 26e197b7f9240a4ac301dd0ad520c0c697c2ea7d ] The blk_add_partition() function initially used a single if-condition (IS_ERR(part)) to check for errors when adding a partition. This was modified to handle the specific case of -ENXIO separately, allowing the function to proceed without logging the error in this case. However, this change unintentionally left a path where md_autodetect_dev() could be called without confirming that part is a valid pointer. This commit separates the error handling logic by splitting the initial if-condition, improving code readability and handling specific error scenarios explicitly. The function now distinguishes the general error case from -ENXIO without altering the existing behavior of md_autodetect_dev() calls. Fixes: b72053072c0b (block: allow partitions on host aware zone devices) Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240911132954.5874-1-riyandhiman14@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:19 +01:00
Christian Heusel	5e674af9b4	block: print symbolic error name instead of error code [ Upstream commit 25c1772a0493463408489b1fae65cf77fe46cac1 ] Utilize the %pe print specifier to get the symbolic error name as a string (i.e "-ENOMEM") in the log message instead of the error code to increase its readablility. This change was suggested in https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/ Signed-off-by: Christian Heusel <christian@heusel.eu> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 26e197b7f924 ("block: fix potential invalid pointer dereference in blk_add_partition") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:19 +01:00
Yu Kuai	25901091f1	block, bfq: don't break merge chain in bfq_split_bfqq() [ Upstream commit 42c306ed723321af4003b2a41bb73728cab54f85 ] Consider the following scenario: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ \| \| \| \-------------\ \-------------\ \--------------\\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq() decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will break the merge chain. Expected result: caller will allocate a new bfqq for BIC1 Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) \| \| \| \-------------\ \--------------\\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 1 3 Since the condition is only used for the last bfqq4 when the previous bfqq2 and bfqq3 are already splited. Fix the problem by checking if bfqq is the last one in the merge chain as well. Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:19 +01:00
Yu Kuai	9ea4bad25e	block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator() [ Upstream commit 0e456dba86c7f9a19792204a044835f1ca2c8dbb ] Consider the following merge chain: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ \| \| \| \--------------\ \-------------\ \-------------\\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 IO from Process 1 will get bfqf2 from BIC1 first, then bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then handle this IO from bfqq3. However, the merge chain can be much deeper and bfqq3 can be merged to other bfqq as well. Fix this problem by iterating to the last bfqq in bfq_setup_cooperator(). Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:19 +01:00
Yu Kuai	7c6f64c78f	block, bfq: fix possible UAF for bfqq->bic with merge chain [ Upstream commit 18ad4df091dd5d067d2faa8fce1180b79f7041a7 ] 1) initial state, three tasks: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) \| Λ \| Λ \| Λ \| \| \| \| \| \| V \| V \| V \| bfqq1 bfqq2 bfqq3 process ref: 1 1 1 2) bfqq1 merged to bfqq2: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) \| \| \| Λ \--------------\\| \| \| V V \| bfqq1--------->bfqq2 bfqq3 process ref: 0 2 1 3) bfqq2 merged to bfqq3: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) here -> Λ \| \| \--------------\ \-------------\\| V V bfqq1--------->bfqq2---------->bfqq3 process ref: 0 1 3 In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then get bfqq3 through merge chain, and finially handle IO by bfqq3. Howerver, current code will think bfqq2 is owned by BIC1, like initial state, and set bfqq2->bic to BIC1. bfq_insert_request -> by Process 1 bfqq = bfq_init_rq(rq) bfqq = bfq_get_bfqq_handle_split bfqq = bic_to_bfqq -> get bfqq2 from BIC1 bfqq->ref++ rq->elv.priv[0] = bic rq->elv.priv[1] = bfqq if (bfqq_process_refs(bfqq) == 1) bfqq->bic = bic -> record BIC1 to bfqq2 __bfq_insert_request new_bfqq = bfq_setup_cooperator -> get bfqq3 from bfqq2->new_bfqq bfqq_request_freed(bfqq) new_bfqq->ref++ rq->elv.priv[1] = new_bfqq -> handle IO by bfqq3 Fix the problem by checking bfqq is from merge chain fist. And this might fix a following problem reported by our syzkaller(unreproducible): ================================================================== BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595 CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_requeue_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x10d/0x610 mm/kasan/report.c:475 kasan_report+0x8e/0xc0 mm/kasan/report.c:588 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757 bfq_init_rq block/bfq-iosched.c:6876 [inline] bfq_insert_request block/bfq-iosched.c:6254 [inline] bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 </TASK> Allocated by task 20776: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:763 [inline] slab_alloc_node mm/slub.c:3458 [inline] kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503 ioc_create_icq block/blk-ioc.c:370 [inline] ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517 blk_mq_get_new_requests block/blk-mq.c:2940 [inline] blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042 __submit_bio+0x331/0x6f0 block/blk-core.c:624 __submit_bio_noacct_mq block/blk-core.c:703 [inline] submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Freed by task 946: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3786 [inline] kmem_cache_free+0x118/0x6f0 mm/slub.c:3808 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Second to last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 The buggy address belongs to the object at ffff888123839d68 which belongs to the cache bfq_io_cq of size 1360 The buggy address is located 336 bytes inside of freed 1360-byte region [ffff888123839d68, ffff88812383a2b8) The buggy address belongs to the physical page: page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838 head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000a40(workingset\|slab\|head\|node=0\|zone=2\|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488 raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:21:19 +01:00
Christoph Hellwig	9e8e62c4a8	block: initialize integrity buffer to zero before writing it to media commit 899ee2c3829c5ac14bfc7d3c4a5846c0b709b78f upstream. Metadata added by bio_integrity_prep is using plain kmalloc, which leads to random kernel memory being written media. For PI metadata this is limited to the app tag that isn't used by kernel generated metadata, but for non-PI metadata the entire buffer leaks kernel memory. Fix this by adding the __GFP_ZERO flag to allocations for writes. Fixes: 7ba1ba12eeef ("block: Block layer data integrity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240613084839.1044015-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-23 23:20:59 +01:00
Christoph Hellwig	a5698c3795	block: remove the blk_flush_integrity call in blk_integrity_unregister [ Upstream commit e8bc14d116aeac8f0f133ec8d249acf4e0658da7 ] Now that there are no indirect calls for PI processing there is no way to dereference a NULL pointer here. Additionally drivers now always freeze the queue (or in case of stacking drivers use their internal equivalent) around changing the integrity profile. This is effectively a revert of commit 3df49967f6f1 ("block: flush the integrity workqueue in blk_integrity_unregister"). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240613084839.1044015-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-23 23:20:58 +01:00
Ksawlii	7fcc5fcd13	Revert "mm: apply init protection" This reverts commit `bcec04dde1`.	2024-11-19 18:15:13 +01:00
ztc1997	136bbfd757	block: Add default I/O scheduler option	2024-11-19 17:43:55 +01:00
Paolo Valente	fbbabdb3bc	block, bfq: use half slice_idle as a threshold to check short ttime The value of the I/O plugging (idling) timeout is used also as the think-time threshold to decide whether a process has a short think time. In this respect, a good value of this timeout for rotational drives is un the order of several ms. Yet, this is often too long a time interval to be effective as a think-time threshold. This commit mitigates this problem (by a lot, according to tests), by halving the threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit b5f74ecacc3139ef873e69acc3aba28083ecc416) (cherry picked from commit b1511c438e8a5668e6be04ad9107d6695332756c) (cherry picked from commit 389992d9dc78340676248d0f01c7569b3db950ed) (cherry picked from commit 49919eface6f4391cda0e77bcaad3e2786cbbab3) (cherry picked from commit 87b015de51122ea9b5d9e56b846ae945db8444f0) (cherry picked from commit 6ada34cdc94c89e97926a2d001412ecc027e1392) (cherry picked from commit 2782bcc2919dd2a0a1d461d36c22338e67bc6327)	2024-11-19 17:43:46 +01:00
Paolo Valente	fe945719eb	block, bfq: increase time window for waker detection Tests on slower machines showed current window to be way too small. This commit increases it. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit ab1fb47e33dc7754a7593181ffe0742c7105ea9a) (cherry picked from commit 0d1663f1922c5f6fb3a4b3cc5a3a861c765a3704) (cherry picked from commit 85d9e1637a38d0cfdeba4e3847f1797dcd18da5d) (cherry picked from commit 6bd707bb9a60e2bf0e680a271208f6c82a331571) (cherry picked from commit 43755e08d048ccd6f3b2a3bbd34bea4a71c5bc12) (cherry picked from commit b1a8cce9e99277ce53da20ab603473ad6c3e95d1) (cherry picked from commit 74d27133a3261a296ddd98e9ff09d89bfab797bb)	2024-11-19 17:43:43 +01:00
Paolo Valente	7034a03ec0	block, bfq: do not raise non-default weights BFQ heuristics try to detect interactive I/O, and raise the weight of the queues containing such an I/O. Yet, if also the user changes the weight of a queue (i.e., the user changes the ioprio of the process associated with that queue), then it is most likely better to prevent BFQ heuristics from silently changing the same weight. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 91b896f65d32610d6d58af02170b15f8d37a7702) (cherry picked from commit cbbd2f045e60073978fe1b721c0953cd8762ecbb) (cherry picked from commit 88b650c71f7d0d30ac2fa215a139d7a48d069cd9) (cherry picked from commit 9a4725f0341c71a9b4f50f2d203f9740029e42e5) (cherry picked from commit a2c57345ffa5404cefd3d43e2fd4e4492ac7c6e0) (cherry picked from commit df56458ca85c681d163d879b832f868ed5044c8e) (cherry picked from commit dfc085aad98db2bcabd2c438fcd722a90303e6cb)	2024-11-19 17:43:40 +01:00
Paolo Valente	4b23f1e69b	block, bfq: do not expire a queue when it is the only busy one This commits preserves I/O-dispatch plugging for a special symmetric case that may suddenly turn into asymmetric: the case where only one bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not cause any harm to any other queues in terms of service guarantees. In contrast, it avoids the following unlucky sequence of events: (1) bfqq is expired, (2) a new queue with a lower weight than bfqq becomes busy (or more queues), (3) the new queue is served until a new request arrives for bfqq, (4) when bfqq is finally served, there are so many requests of the new queue in the drive that the pending requests for bfqq take a lot of time to be served. In particular, event (2) may case even already dispatched requests of bfqq to be delayed, inside the drive. So, to avoid this series of events, the scenario is preventively declared as asymmetric also if bfqq is the only busy queues. By doing so, I/O-dispatch plugging is performed for bfqq. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2391d13ed484df1515f0025458e1f82317823fab) (cherry picked from commit 79827eb41d8fb0f838a2c592775a8e63caeb7c57) (cherry picked from commit 41720669259995fb7f064fc0f988c9d228750b37) (cherry picked from commit 07d273c955ea2c34a42f6de0f1e3f1bfb00c6ce1) (cherry picked from commit 8034c856b8fcafbef405eedddc12bb0625e52a42) (cherry picked from commit f49083d304bda30647196b550a109f528c8266dc) (cherry picked from commit 8a597f0ab5e7e83bfa426d071185c3d3ce5fa535)	2024-11-19 17:43:34 +01:00
Paolo Valente	5238084cd8	block, bfq: save also injection state on queue merging To prevent injection information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 5a5436b98d5cd2714feaaa579cec49dd7f7057bb) (cherry picked from commit 9372e98dc77c7f2ebbb808a60abb01f30d70d0bc) (cherry picked from commit e6a5b66cfe56495f26182cfd2340e3336bb4b2b4) (cherry picked from commit c579a3634d163ed05cc4ac258411f03db969926e) (cherry picked from commit 359f87d07390f687634185b0dd9d6f106fb5afdd) (cherry picked from commit d1d1f1336ed77b83e98d26175e196b45a28958f4) (cherry picked from commit 0ff8068594640924e0cffe27d8b0273bb80d74ca)	2024-11-19 17:43:15 +01:00
Paolo Valente	0769622634	block, bfq: save also weight-raised service on queue merging To prevent weight-raising information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit e673914d52f913584cc4c454dfcff2e8eb04533f) (cherry picked from commit 48f3cf9bb6ae73de3e8e6cad2e50c6e70a6cd33f) (cherry picked from commit d947cf3f8bcbcbe2dd8f5eec82e83a35198f874b) (cherry picked from commit 39b91f1f22265c70cdc48916ac694dad6c21c191) (cherry picked from commit 421c82648e46467d29dc0b5cd5522f00a026083d) (cherry picked from commit e9eecde7c67303c1dc87864c10c372019d609b0b) (cherry picked from commit 41d4c63679c36dc63b4cc9be301ec8d8d518d33f)	2024-11-19 17:43:10 +01:00
Paolo Valente	5267faf794	block, bfq: fix switch back from soft-rt weitgh-raising A bfq_queue may happen to be deemed as soft real-time while it is still enjoying interactive weight-raising. If this happens because of a false positive, then the bfq_queue is likely to loose its soft real-time status soon. Upon losing such a status, the bfq_queue must get back its interactive weight-raising, if its interactive period is not over yet. But this case is not handled. This commit corrects this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit d1f600fa4732dac36c71a03b790f0c829a076475) (cherry picked from commit db891a7d6aed6cc37d681d2bbf6c9bd697059281) (cherry picked from commit 647b877a9a8493df84a1d4abd94be089c8fed49b) (cherry picked from commit 7eda6de0bbbfa1d05b8888b697d9b7aeffe4d64e) (cherry picked from commit c1e076d9f4688c77dfa0f859060ae1f27a8d889e) (cherry picked from commit db0058abb7534aeb0abebe01c65659aa3886de78) (cherry picked from commit 40bc06529a2053ca0caf2053dd6f2a27bf7af916)	2024-11-19 17:42:58 +01:00
Paolo Valente	8b47ef547b	block, bfq: re-evaluate convenience of I/O plugging on rq arrivals Upon an I/O-dispatch attempt, BFQ may detect that it was better to plug I/O dispatch, and to wait for a new request to arrive for the currently in-service queue. But the arrival of a new request for an empty bfq_queue, and thus the switch from idle to busy of the bfq_queue, may cause the scenario to change, and make plugging no longer needed for service guarantees, or more convenient for throughput. In this case, keeping I/O-dispatch plugged would certainly lower throughput. To address this issue, this commit makes such a check, and stops plugging I/O if it is better to stop plugging I/O. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 7f1995c27b19060dbdff23442f375e3097c90707) (cherry picked from commit 12ec5a8ca2486d06f880d41751383c0d9549ba49) (cherry picked from commit 64c6efc5ccb01edf553487aff312c0b7110cb30f) (cherry picked from commit 3e04c1949f447a8166fa6d6343bd5332d8c12a4b) (cherry picked from commit 40a263c36cf2094311e8189b6e9173360a808b12) (cherry picked from commit 61a02ce46503671c747e550a13972ca8abaf5030) (cherry picked from commit 3707ff2d32dccd807b8e5e6885f07f3874c71180)	2024-11-19 17:42:55 +01:00
Jan Kara	b93af2c415	bfq: Use 'ttime' local variable Use local variable 'ttime' instead of dereferencing bfqq. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 28c6def009192b673f92ea357dfb535ba15e00a4) (cherry picked from commit bb2a213aa0a2b717c3a6e7848c6f82656d80897f) (cherry picked from commit 2e0cfffb9a6da88cb1a786fb95618bfa714fea32) (cherry picked from commit caff780963fdfda0ab456c24027298482d745b2f) (cherry picked from commit b893b660ea8e998b760d48faeed2834e483158ad) (cherry picked from commit 7e3d952af5fdcf6b02d01d55dbf658fbc2d67f41) (cherry picked from commit 033b49f66e3808fead9e65e7c9417f26d423374f)	2024-11-19 17:42:05 +01:00
Joseph Qi	fdcb87e105	block/bfq: update comments and default value in docs for fifo_expire Correct the comments since bfq_fifo_expire[0] is for async request, while bfq_fifo_expire[1] is for sync request. Also update docs, according the source code, the default fifo_expire_async is 250ms, and fifo_expire_sync is 125ms. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 4168a8d27ed3a00f160e7f885c956f060d2a0741) (cherry picked from commit a31ff2eb7d7cfa8331e513bb282f304117f18a77) (cherry picked from commit a78637befaa4106f9858b3ad8e3273960d3de82b) (cherry picked from commit bd8e7d3845c7a3b602aee361c7e3d0b5764ce060) (cherry picked from commit a8543954accfadbb9a1cf1f64c6b3749ee3a629b) (cherry picked from commit 960981f44b77dcd0d4e786aaef72d39057ccfc03) (cherry picked from commit 50cfb4b6c1c2e4a3778f66510fee7a2e86e053f2)	2024-11-19 17:41:49 +01:00
Paolo Valente	8eb5a42575	block, bfq: always inject I/O of queues blocked by wakers Suppose that I/O dispatch is plugged, to wait for new I/O for the in-service bfq-queue, say bfqq. Suppose then that there is a further bfq_queue woken by bfqq, and that this woken queue has pending I/O. A woken queue does not steal bandwidth from bfqq, because it remains soon without I/O if bfqq is not served. So there is virtually no risk of loss of bandwidth for bfqq if this woken queue has I/O dispatched while bfqq is waiting for new I/O. In contrast, this extra I/O injection boosts throughput. This commit performs this extra injection. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 2ec5a5c48373d4bc2f0699f86507a65bf0b9df35) (cherry picked from commit 0750db9767232fc2e4850868e526f4b02ecfb247) (cherry picked from commit 8676f43249bbb0478a8b18bd87703da59902dbfd) (cherry picked from commit df655d250f253a2f8a6792569108f30a04b7b894) (cherry picked from commit d76168c1c3805a2c948e7ff60c8eb341e2ff0013) (cherry picked from commit f213ae4e575f8ed67ae065fe80d06dc957f0b068) (cherry picked from commit eb1ff3ab6d66081fbaf007c6cfc1a5e841719c0c)	2024-11-19 17:41:42 +01:00
Jan Kara	9abe5bf065	bfq: Provide helper to generate bfqq name Instead of having helper formating bfqq pid, provide a helper to generate full bfqq name as used in the traces. It saves some code duplication and will save more in the coming tracepoints. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 582f04e19ad7b41df993c669805e48a01bcd9c5b) (cherry picked from commit e030e88a4c2e220366f3db1af33d72d9638f93b5) (cherry picked from commit e925a5fdce15f914ec2386b03bf64242792acce0) (cherry picked from commit 9265a0e6952305932aa2b5caf2183387859dcfce) (cherry picked from commit 41794de36673c11faca8c57625dfa50b76edde20) (cherry picked from commit 5e830976b50a9f0a2c927b02f921f0d6ae796183) (cherry picked from commit b5344876556e4a62cac7905bf11ca7ccf8d16d6d)	2024-11-19 17:41:18 +01:00
Yahu Gao	1297c45dcc	block/bfq_wf2q: correct weight to ioprio The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0. What we want is ioprio or 0. Correct this by changing the calculation. Signed-off-by: Yahu Gao <gaoyahu19@gmail.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit bcd2be763252f3a4d5fc4d6008d4d96c601ee74b) (cherry picked from commit 81806db867a17e49d37b1d556dd39f4da5227f56) (cherry picked from commit aed9dbfda208b30130c64bac55570e2f89084d2b) (cherry picked from commit 7158b54afec4b986d52cc646a5dffc30eac6dc19) (cherry picked from commit fb4f80f773e0fc89f372c7afda9c8e9794849f67) (cherry picked from commit 5ad409c78ed2bfca202490fa13f0a93c49f21382)	2024-11-19 17:40:48 +01:00
Jan Kara	c72c9473f5	blk: Fix lock inversion between ioc lock and bfqd lock Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit fd2ef39cc9a6b9c4c41864ac506906c52f94b06a) (cherry picked from commit 786e392c4a7bd2559bdc1a1c6ac28d8b612a0735) (cherry picked from commit aa8e3e1451bde73dff60f1e5110b6a3cb810e35b) (cherry picked from commit 4deef6abb13a82b148c583d9ab37374c876fe4c2) (cherry picked from commit 1988f864ec1c494bb54e5b9df1611195f6d923f2) (cherry picked from commit 9dc0074b0dd8960f9e06dc1494855493ff53eb68) (cherry picked from commit c937983724111bb4526e34da0d5c6c8aea1902af)	2024-11-19 17:40:26 +01:00
Johannes Weiner	90c0c9aa4a	cgroup: rstat: punt root-level optimization to individual controllers Current users of the rstat code can source root-level statistics from the native counters of their respective subsystem, allowing them to forego aggregation at the root level. This optimization is currently implemented inside the generic rstat code, which doesn't track the root cgroup and doesn't invoke the subsystem flush callbacks on it. However, the memory controller cannot do this optimization, because cgroup1 breaks out memory specifically for the local level, including at the root level. In preparation for the memory controller switching to rstat, move the optimization from rstat core to the controllers. Afterwards, rstat will always track the root cgroup for changes and invoke the subsystem callbacks on it; and it's up to the subsystem to special-case and skip aggregation of the root cgroup if it can source this information through other, cheaper means. This is the case for the io controller and the cgroup base stats. In their respective flush callbacks, check whether the parent is the root cgroup, and if so, skip the unnecessary upward propagation. The extra cost of tracking the root cgroup is negligible: on stat changes, we actually remove a branch that checks for the root. The queueing for a flush touches only per-cpu data, and only the first stat change since a flush requires a (per-cpu) lock. Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit dc26532aed0ab25c0801a34640d1f3b9b9098a48) (cherry picked from commit 69da183fcd0112af130879a1c93113a941e2241b) (cherry picked from commit ddf1013871482b246147e71a04c865c1be5cf74d) (cherry picked from commit 30fcd52e18dd1d508b1b22f7c660ac22de734f67) (cherry picked from commit 19c9a1b9d9ae9a4f359deaf89101f9013254f43d) (cherry picked from commit 0b4286aea9bb0a6ea6acb723f8396e476044190b)	2024-11-19 17:40:21 +01:00
Nahuel Gómez	a4d33f6631	block: ssg-iosched: adapt to new patches ../block/ssg-iosched.c:684:41: error: too few arguments to function call, expected 3, have 2 684 \| if (blk_mq_sched_try_insert_merge(q, rq)) \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ../block/blk-mq-sched.h:15:6: note: 'blk_mq_sched_try_insert_merge' declared here 15 \| bool blk_mq_sched_try_insert_merge(struct request_queue q, struct request rq, \| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16 \| struct list_head *free); \| ~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Signed-off-by: Nahuel Gómez <nahuelgomez329@gmail.com>	2024-11-19 17:40:09 +01:00
Adam W. Willis	bcec04dde1	mm: apply init protection Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com> Change-Id: I1a1928fec9efeb29203a94644388c3ca48e7d96e [TogoFire]: adapt to k5.4. Signed-off-by: TogoFire <togofire@mailfence.com>	2024-11-19 17:39:06 +01:00
Justin Stitt	bf03d87826	block/ioctl: prefer different overflow check [ Upstream commit ccb326b5f9e623eb7f130fbbf2505ec0e2dcaff9 ] Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 14:19:06 +01:00
Rik van Riel	bba9ddbe67	blk-iocost: avoid out of bounds shift [ Upstream commit beaa51b36012fad5a4d3c18b88a617aea7a9b96d ] UBSAN catches undefined behavior in blk-iocost, where sometimes iocg->delay is shifted right by a number that is too large, resulting in undefined behavior on some architectures. [ 186.556576] ------------[ cut here ]------------ UBSAN: shift-out-of-bounds in block/blk-iocost.c:1366:23 shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long') CPU: 16 PID: 0 Comm: swapper/16 Tainted: G S E N 6.9.0-0_fbk700_debug_rc2_kbuilder_0_gc85af715cac0 #1 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020 Call Trace: <IRQ> dump_stack_lvl+0x8f/0xe0 __ubsan_handle_shift_out_of_bounds+0x22c/0x280 iocg_kick_delay+0x30b/0x310 ioc_timer_fn+0x2fb/0x1f80 __run_timer_base+0x1b6/0x250 ... Avoid that undefined behavior by simply taking the "delay = 0" branch if the shift is too large. I am not sure what the symptoms of an undefined value delay will be, but I suspect it could be more than a little annoying to debug. Signed-off-by: Rik van Riel <riel@surriel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240404123253.0f58010f@imladris.surriel.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 11:32:44 +01:00
Roman Smirnov	531f682d48	block: prevent division by zero in blk_rq_stat_sum() [ Upstream commit 93f52fbeaf4b676b21acfe42a5152620e6770d02 ] The expression dst->nr_samples + src->nr_samples may have zero value on overflow. It is necessary to add a check to avoid division by zero. Found by Linux Verification Center (linuxtesting.org) with Svace. Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 09:23:14 +01:00
Min Li	f7b14efbca	block: add check that partition length needs to be aligned with block size commit 6f64f866aa1ae6975c95d805ed51d7e9433a0016 upstream. Before calling add partition or resize partition, there is no check on whether the length is aligned with the logical block size. If the logical block size of the disk is larger than 512 bytes, then the partition size maybe not the multiple of the logical block size, and when the last sector is read, bio_truncate() will adjust the bio size, resulting in an IO error if the size of the read command is smaller than the logical block size.If integrity data is supported, this will also result in a null pointer dereference when calling bio_integrity_free. Cc: <stable@vger.kernel.org> Signed-off-by: Min Li <min15.li@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-11-19 09:22:46 +01:00
Damien Le Moal	8054588c73	block: Clear zone limits for a non-zoned stacked queue [ Upstream commit c8f6f88d25929ad2f290b428efcae3b526f3eab0 ] Device mapper may create a non-zoned mapped device out of a zoned device (e.g., the dm-zoned target). In such case, some queue limit such as the max_zone_append_sectors and zone_write_granularity endup being non zero values for a block device that is not zoned. Avoid this by clearing these limits in blk_stack_limits() when the stacked zoned limit is false. Fixes: 3093a479727b ("block: inherit the zoned characteristics in blk_stack_limits") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240222131724.1803520-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 09:22:16 +01:00
Damien Le Moal	08e96cffc0	block: introduce zone_write_granularity limit [ Upstream commit a805a4fa4fa376bbc145762bb8b09caa2fa8af48 ] Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that all writes into sequential write required zones be aligned to the device physical block size. However, NVMe ZNS does not have this constraint and allows write operations into sequential zones to be aligned to the device logical block size. This inconsistency does not help with software portability across device types. To solve this, introduce the zone_write_granularity queue limit to indicate the alignment constraint, in bytes, of write operations into zones of a zoned block device. This new limit is exported as a read-only sysfs queue attribute and the helper blk_queue_zone_write_granularity() introduced for drivers to set this limit. The function blk_queue_set_zoned() is modified to set this new limit to the device logical block size by default. NVMe ZNS devices as well as zoned nullb devices use this default value as is. The scsi disk driver is modified to execute the blk_queue_zone_write_granularity() helper to set the zone write granularity of host-managed SMR disks to the disk physical block size. The accessor functions queue_zone_write_granularity() and bdev_zone_write_granularity() are also introduced. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c8f6f88d2592 ("block: Clear zone limits for a non-zoned stacked queue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-19 09:22:16 +01:00

1 2

65 commits