remote vs local window discrepancy

Fri Jul 23 07:09:19 EST 2010

I am utilizing an openssh tunnel between two Linux boxes.  On the client
box I issue the following commands to set up the tunnel;

- ssh -w0:0 root at x.x.x.x -v where x.x.x.x is the IP address of the Linux
system running sshd
- ip addr add 10.0.5.1/32 peer 10.0.5.2 dev tun0
- ip link set tun0 up

On the box running sshd I issue the following commands:
- ip addr add 10.0.5.2/32 peer 10.0.5.1 dev tun0
- ip link set tun0 up

The SSH tunnel comes up just fine.  I have a testcase that runs 5000 byte
pings between the two boxes with a .02 increment (e.g. ping -s5000
10.0.5.2 -i .02).  After roughly 6000 pings the connection stalls.  It does
not not matter which box I initiate the pings from.  The MTU size is 1500.

The stall occurs because the client's remote window count for the tun
channel goes to 0.  The server's local window count is much larger.  Given
the discrepancy between the client and server's view of the server's window
size a SSH_MSG_CHANNEL_WINDOW_ADJUST message is never sent once the
client's remote window count goes to 0.  The client never attempts to read
off the tun device file descriptor again.

After some investigation I determined that for every packet sent the client
is decrementing Channel.remote_window by a value that is 4 bytes larger
than the amount that the server decrements Channel.local_window and
Channel.local_consumed.  Prior to the stall the server does send
SSH_MSG_CHANNEL_WINDOW_ADJUST messages.  When it does the "byte to add"
value is off by 4x the number of packets consumed by the server.
Eventually over time this drives the client's remote window count to go to
zero.  As an aside the remote window count has to be exactly 0 for the
stall to occur.

Initially the following line of code in channel_output_poll that decrements
the remote window count  for datagram channels looked suspicious:

c->remote_window -= dlen + 4;

However, the code that updates Channel.local_window and
Channel.local_consumed for a datagram channel also includes the +4 in the
calculation.  Does anybody know why the datagram calculation includes a +4?
Anybody know what would cause the 4 byte discrepancy I am seeing?

A complicating factor is that in channel_output_poll the calculation to
update the remote window in the datagram case does not take into account
that dlen may be larger than the remote_window size.  Does anybody know
why?  Perhaps there is a check elsewhere that makes this safe, but I am not
seeing it.  During problem determination I have observed the value of the
remote window does occasionally wrap.  When the remote window counter does
wrap it goes undetected because Channel.remote_window is an unsigned value.

Another item I find confusing is the test in channel_pre_open to decide if
the channel's read file descriptor should be turned on in the read fileset.
That test includes a check of a variable called limit which is set to
Channel.remote_window when compat20 is true.  Can somebody explain why this
is remote_window instead of local_window?  The check is "limit > 0" which
is why the wrapping of remote_window goes undetected

Any insight into these questions will be appreciated.  Thanks.

Dave Wierbowski