8 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Neal Cardwell
|
0295a863b0 |
tcp: fix to allow timestamp undo if no retransmits were sent
[ Upstream commit e37ab7373696e650d3b6262a5b882aadad69bb9e ] Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that it can trigger undo even if TSQ prevents a fast recovery episode from reaching tcp_retransmit_skb(). Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after this commit from 2019: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") ...and before this fix we could have buggy scenarios like the following: + Due to reordering, a TCP connection receives some SACKs and enters a spurious fast recovery. + TSQ prevents all invocations of tcp_retransmit_skb(), because many skbs are queued in lower layers of the sending machine's network stack; thus tp->retrans_stamp remains 0. + The connection receives a TCP timestamp ECR value echoing a timestamp before the fast recovery, indicating that the fast recovery was spurious. + The connection fails to undo the spurious fast recovery because tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false, due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") This fix tweaks the logic to be more similar to the tcp_packet_delayed() logic before bc9f38c8328e, except that we take care not to be fooled by the FLAG_SYN_ACKED code path zeroing out tp->retrans_stamp (the bug noted and fixed by Yuchung in bc9f38c8328e). Note that this returns the high-level behavior of tcp_packet_delayed() to again match the comment for the function, which says: "Nothing was retransmitted or returned timestamp is less than timestamp of the first retransmission." Note that this comment is in the original 2005-04-16 Linux git commit, so this is evidently long-standing behavior. Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit") Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com> Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Neal Cardwell
|
7d98eb18be |
tcp: fix incorrect undo caused by DSACK of TLP retransmit
[ Upstream commit 0ec986ed7bab6801faed1440e8839dcc710331ff ] Loss recovery undo_retrans bookkeeping had a long-standing bug where a DSACK from a spurious TLP retransmit packet could cause an erroneous undo of a fast recovery or RTO recovery that repaired a single really-lost packet (in a sequence range outside that of the TLP retransmit). Basically, because the loss recovery state machine didn't account for the fact that it sent a TLP retransmit, the DSACK for the TLP retransmit could erroneously be implicitly be interpreted as corresponding to the normal fast recovery or RTO recovery retransmit that plugged a real hole, thus resulting in an improper undo. For example, consider the following buggy scenario where there is a real packet loss but the congestion control response is improperly undone because of this bug: + send packets P1, P2, P3, P4 + P1 is really lost + send TLP retransmit of P4 + receive SACK for original P2, P3, P4 + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1 + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!) + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole) The fix: when we initialize undo machinery in tcp_init_undo(), if there is a TLP retransmit in flight, then increment tp->undo_retrans so that we make sure that we receive a DSACK corresponding to the TLP retransmit, as well as DSACKs for all later normal retransmits, before triggering a loss recovery undo. Note that we also have to move the line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO we remember the tp->tlp_high_seq value until tcp_init_undo() and clear it only afterward. Also note that the bug dates back to the original 2013 TLP implementation, commit 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)"). However, this patch will only compile and work correctly with kernels that have tp->tlp_retrans, which was added only in v5.8 in 2020 in commit 76be93fc0702 ("tcp: allow at most one TLP probe per flight"). So we associate this fix with that later commit. Fixes: 76be93fc0702 ("tcp: allow at most one TLP probe per flight") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Kevin Yang <yyd@google.com> Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Neal Cardwell
|
4cf42b34a9 |
UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()
[ Upstream commit a6458ab7fd4f427d4f6f54380453ad255b7fde83 ] In some production workloads we noticed that connections could sometimes close extremely prematurely with ETIMEDOUT after transmitting only 1 TLP and RTO retransmission (when we would normally expect roughly tcp_retries2 = TCP_RETR2 = 15 RTOs before a connection closes with ETIMEDOUT). From tracing we determined that these workloads can suffer from a scenario where in fast recovery, after some retransmits, a DSACK undo can happen at a point where the scoreboard is totally clear (we have retrans_out == sacked_out == lost_out == 0). In such cases, calling tcp_try_keep_open() means that we do not execute any code path that clears tp->retrans_stamp to 0. That means that tp->retrans_stamp can remain erroneously set to the start time of the undone fast recovery, even after the fast recovery is undone. If minutes or hours elapse, and then a TLP/RTO/RTO sequence occurs, then the start_ts value in retransmits_timed_out() (which is from tp->retrans_stamp) will be erroneously ancient (left over from the fast recovery undone via DSACKs). Thus this ancient tp->retrans_stamp value can cause the connection to die very prematurely with ETIMEDOUT via tcp_write_err(). The fix: we change DSACK undo in fast recovery (TCP_CA_Recovery) to call tcp_try_to_open() instead of tcp_try_keep_open(). This ensures that if no retransmits are in flight at the time of DSACK undo in fast recovery then we properly zero retrans_stamp. Note that calling tcp_try_to_open() is more consistent with other loss recovery behavior, since normal fast recovery (CA_Recovery) and RTO recovery (CA_Loss) both normally end when tp->snd_una meets or exceeds tp->high_seq and then in tcp_fastretrans_alert() the "default" switch case executes tcp_try_to_open(). Also note that by inspection this change to call tcp_try_to_open() implies at least one other nice bug fix, where now an ECE-marked DSACK that causes an undo will properly invoke tcp_enter_cwr() rather than ignoring the ECE mark. Fixes: c7d9d6a185a7 ("tcp: undo on DSACK during recovery") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Eric Dumazet
|
2785cbe7c9 |
tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
[ Upstream commit 94062790aedb505bdda209b10bea47b294d6394f ] TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers. In the following crash [1], syzbot managed to trigger a divide by zero in tcp_rcv_space_adjust() A socket makes the following state transitions, without ever calling tcp_init_transfer(), meaning tcp_init_buffer_space() is also not called. TCP_CLOSE connect() TCP_SYN_SENT TCP_SYN_RECV shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN) TCP_FIN_WAIT1 To fix this issue, change tcp_shutdown() to not perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition, which makes no sense anyway. When tcp_rcv_state_process() later changes socket state from TCP_SYN_RECV to TCP_ESTABLISH, then look at sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state, and send a FIN packet from a sane socket state. This means tcp_send_fin() can now be called from BH context, and must use GFP_ATOMIC allocations. [1] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767 Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48 RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246 RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7 R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30 R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513 tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578 inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x109/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2803 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7faeb6363db9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9 RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Eric Dumazet
|
d1fdbc4379 |
tcp: do not accept ACK of bytes we never sent
[ Upstream commit 3d501dd326fb1c73f1b8206d4c6e1d7b15c07e27 ] This patch is based on a detailed report and ideas from Yepeng Pan and Christian Rossow. ACK seq validation is currently following RFC 5961 5.2 guidelines: The ACK value is considered acceptable only if it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT). All incoming segments whose ACK value doesn't satisfy the above condition MUST be discarded and an ACK sent back. It needs to be noted that RFC 793 on page 72 (fifth check) says: "If the ACK is a duplicate (SEG.ACK < SND.UNA), it can be ignored. If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an ACK, drop the segment, and return". The "ignored" above implies that the processing of the incoming data segment continues, which means the ACK value is treated as acceptable. This mitigation makes the ACK check more stringent since any ACK < SND.UNA wouldn't be accepted, instead only ACKs that are in the range ((SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT) get through. This can be refined for new (and possibly spoofed) flows, by not accepting ACK for bytes that were never sent. This greatly improves TCP security at a little cost. I added a Fixes: tag to make sure this patch will reach stable trees, even if the 'blamed' patch was adhering to the RFC. tp->bytes_acked was added in linux-4.2 Following packetdrill test (courtesy of Yepeng Pan) shows the issue at hand: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1024) = 0 // ---------------- Handshake ------------------- // // when window scale is set to 14 the window size can be extended to // 65535 * (2^14) = 1073725440. Linux would accept an ACK packet // with ack number in (Server_ISN+1-1073725440. Server_ISN+1) // ,though this ack number acknowledges some data never // sent by the server. +0 < S 0:0(0) win 65535 <mss 1400,nop,wscale 14> +0 > S. 0:0(0) ack 1 <...> +0 < . 1:1(0) ack 1 win 65535 +0 accept(3, ..., ...) = 4 // For the established connection, we send an ACK packet, // the ack packet uses ack number 1 - 1073725300 + 2^32, // where 2^32 is used to wrap around. // Note: we used 1073725300 instead of 1073725440 to avoid possible // edge cases. // 1 - 1073725300 + 2^32 = 3221241997 // Oops, old kernels happily accept this packet. +0 < . 1:1001(1000) ack 3221241997 win 65535 // After the kernel fix the following will be replaced by a challenge ACK, // and prior malicious frame would be dropped. +0 > . 1:1(0) ack 1001 Fixes: 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yepeng Pan <yepeng.pan@cispa.de> Reported-by: Christian Rossow <rossow@cispa.de> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20231205161841.2702925-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Aananth V
|
3aca5ccfb5 |
tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed
[ Upstream commit e326578a21414738de45f77badd332fb00bd0f58 ] For passive TCP Fast Open sockets that had SYN/ACK timeout and did not send more data in SYN_RECV, upon receiving the final ACK in 3WHS, the congestion state may awkwardly stay in CA_Loss mode unless the CA state was undone due to TCP timestamp checks. However, if tcp_rcv_synrecv_state_fastopen() decides not to undo, then we should enter CA_Open, because at that point we have received an ACK covering the retransmitted SYNACKs. Currently, the icsk_ca_state is only set to CA_Open after we receive an ACK for a data-packet. This is because tcp_ack does not call tcp_fastretrans_alert (and tcp_process_loss) if !prior_packets Note that tcp_process_loss() calls tcp_try_undo_recovery(), so having tcp_rcv_synrecv_state_fastopen() decide that if we're in CA_Loss we should call tcp_try_undo_recovery() is consistent with that, and low risk. Fixes: dad8cea7add9 ("tcp: fix TFO SYNACK undo to avoid double-timestamp-undo") Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Fred Chen
|
f7b4fe9fca |
tcp: fix wrong RTO timeout when received SACK reneging
[ Upstream commit d2a0fc372aca561556e765d0a9ec365c7c12f0ad ] This commit fix wrong RTO timeout when received SACK reneging. When an ACK arrived pointing to a SACK reneging, tcp_check_sack_reneging() will rearm the RTO timer for min(1/2*srtt, 10ms) into to the future. But since the commit 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") merged, the tcp_set_xmit_timer() is moved after tcp_fastretrans_alert()(which do the SACK reneging check), so the RTO timeout will be overwrited by tcp_set_xmit_timer() with icsk_rto instead of 1/2*srtt. Here is a packetdrill script to check this bug: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // simulate srtt to 100ms +0 < S 0:0(0) win 32792 <mss 1000, sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +.1 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 10000) = 10000 +0 > P. 1:10001(10000) ack 1 // inject sack +.1 < . 1:1(0) ack 1 win 257 <sack 1001:10001,nop,nop> +0 > . 1:1001(1000) ack 1 // inject sack reneging +.1 < . 1:1(0) ack 1001 win 257 <sack 9001:10001,nop,nop> // we expect rto fired in 1/2*srtt (50ms) +.05 > . 1001:2001(1000) ack 1 This fix remove the FLAG_SET_XMIT_TIMER from ack_flag when tcp_check_sack_reneging() set RTO timer with 1/2*srtt to avoid being overwrited later. Fixes: 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") Signed-off-by: Fred Chen <fred.chenchen03@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Gabriel2392
|
7ed7ee9edf | Import A536BXXU9EXDC |