[Bug 3898] New: Local port forwarding seemingly stalls with channel remote_window reaching 0
bugzilla-daemon at mindrot.org
bugzilla-daemon at mindrot.org
Sat Nov 29 06:04:51 AEDT 2025
https://bugzilla.mindrot.org/show_bug.cgi?id=3898
Bug ID: 3898
Summary: Local port forwarding seemingly stalls with channel
remote_window reaching 0
Product: Portable OpenSSH
Version: 9.2p1
Hardware: All
OS: Linux
Status: NEW
Severity: minor
Priority: P5
Component: ssh
Assignee: unassigned-bugs at mindrot.org
Reporter: keith.zeto at proton.me
Hi,
I'm observing a weird potential race condition in SSH local port
forwarding (-L) that causes connections to stall and timeout. The issue
seems to impact lower latency environments with moderate response sizes
(10-70KB). The current use case is I'm using SSH tunneling through a
bastion host to access a Kubernetes control plane in a private network,
and we're frequently, but intermittently, seeing failures on kubectl
commands through the established tunnel. The tunnel randomly stalls
after 5-30 requests. The ssh channel seems to have data in its buffer
but the localhost connection is in a CLOSE_WAIT state, so it seems ssh
received a FIN but has unprocessed data. And ultimately this causes the
client (kubectl, helm, etc.) to time out.
It seems like channel_pre_open() excludes the channel from ppoll when
remote_window == 0 and this creates a deadlock - can't receive window
adjustments if you're not polling (but take this with a grain of salt I
have no idea what I'm really talking about here).
I could also be doing something stupid on my end so I'm wondering if
anyone has hit a similar issue before, or if this is indicative of some
sort of bug, since I've hit a bit of a wall with this. I've also played
extensively with tweaking some of the TCP connection settings to no
avail.
Here are some additional details:
Reproduction:
ssh -f -N -L 6443:kubernetes-api:443 user at bastion
for i in $(seq 1 100); do
kubectl get pods -A && echo "OK $i"
done
Expected: All 100 requests succeed
Actual: Fails after 5-30 iterations with connection timeout
Observations:
- Lower latency connectivity seems to fail faster, we don't see this
issue when the client is in a different region than the bastion.
- Running strace on ssh pretty much makes the bug disappear (probably
by slowing the event loop?)
- Works reliably on localhost through bastion (probably also due to
latency difference?)
VERSION AND PLATFORM
--------------------
Client: OpenSSH_9.6p1 Ubuntu-3ubuntu13.5, OpenSSL 3.0.13
Server: OpenSSH_9.2p1 Debian-2+deb12u3, OpenSSL 3.0.14
Platform: Ubuntu 24.04 in Firecracker microVM (client), Debian 12
(server)
Network: ~2ms RTT through GCP IAP tunnel
ROOT CAUSE ANALYSIS
-------------------
Since strace seemed to mask the issue, I used bpftrace, tracking the
ppoll() calls and that gave a rough idea where the issue was.
In channel_pre_open() (channels.c ~line 1333), there's a check:
if (c->istate == CHAN_INPUT_OPEN &&
c->remote_window > 0 && // <-- Potential
problem? sshbuf_len(c->input) < c->remote_window &&
sshbuf_check_reserve(c->input, CHAN_RBUF) == 0)
c->io_want |= SSH_CHAN_IO_RFD;
When remote_window reaches 0, the channel's read fd is excluded from
ppoll()
entirely. This creates a deadlock:
- SSH won't read because window is 0
- Window can't increase because SSH isn't processing incoming data
- Response sits unread in kernel buffer until client timeout
BPFTRACE
-----------------
Trace captured at failure point (timestamps in nanoseconds):
450966193 ppoll_enter nfds=5
fds=[(3,0x1),(3,0x0),(4,0x1),(5,0x1),(6,0x1)]
536186178 ppoll_exit ret=1 # 85ms later, 1 fd ready
536198150 read(6, 32768) # Reads from LOCAL socket, not SSH
channel
536200264 read_ret=0 # EOF - client gave up
The ppoll shows (3,0x0) - the channel write fd has no events requested.
When ppoll returns, SSH reads fd 6 (local socket) getting EOF because
the client timed out. The response data on the SSH channel was never
read.
The channel fd wasn't in the ppoll array because remote_window was 0 at
the moment channel_pre_open() ran.
bpftrace command used:
bpftrace -e '
tracepoint:syscalls:sys_enter_ppoll /pid == '$SSH_PID'/ {
$fds = (struct pollfd *)args->ufds;
$n = args->nfds;
printf("%d ppoll_enter nfds=%d fds=[", nsecs, $n);
if ($n > 0) { printf("(%d,0x%x)", $fds[0].fd, $fds[0].events); }
if ($n > 1) { printf(",(%d,0x%x)", $fds[1].fd, $fds[1].events); }
if ($n > 2) { printf(",(%d,0x%x)", $fds[2].fd, $fds[2].events); }
if ($n > 3) { printf(",(%d,0x%x)", $fds[3].fd, $fds[3].events); }
if ($n > 4) { printf(",(%d,0x%x)", $fds[4].fd, $fds[4].events); }
printf("]\n");
}
tracepoint:syscalls:sys_exit_ppoll /pid == '$SSH_PID'/ {
printf("%d ppoll_exit ret=%d\n", nsecs, args->ret);
}
tracepoint:syscalls:sys_enter_read /pid == '$SSH_PID'/ {
printf("%d read(%d, %d)\n", nsecs, args->fd, args->count);
}
tracepoint:syscalls:sys_exit_read /pid == '$SSH_PID'/ {
printf("%d read_ret=%d\n", nsecs, args->ret);
}
'
WORKAROUND
----------
Using socat in fork mode avoids the issue because each request gets a
fresh SSH channel, e.g.
socat TCP-LISTEN:6443,fork,reuseaddr \
"EXEC:ssh bastion 'socat STDIO TCP:kubernetes-api:443'"
Happy to provide additional traces, test patches, or clarify anything.
--
You are receiving this mail because:
You are watching the assignee of the bug.
More information about the openssh-bugs
mailing list