Server/Client Alive mechanism issues

Fri Jan 10 04:18:51 EST 2014

Old thread I know but I have opposite problem.  Maybe SSH was changed in 
connection with this  report ?  See my recent (Jan 2014) ML thread.

I am observing SSH waiting for a TCP level timeout to occur when the 
other end has done away (and it not sending back any data or TCP RST).

Jeff Mitchell wrote:
> I have a bandwidth-constrained connection that I'd like to run rsync
> over through an SSH tunnel. I also want to detect any network drops
> pretty rapidly.

If you are bandwidth constrained why are you wasting bandwidth on 1 
second ping-pongs ?  What % of your overall data are you wasting on that 
effort?

Does your usage of the application require connection recovery (for a 
stalled, non-working connection) within 10s of seconds ?  So you are in 
a bandwidth contained environment trying to send bulk data and must know 
if the other end has become unavailable within 6 seconds of it doing so ?

If you're bandwidth constrained I would have thought both ends would be 
patient when waiting for data and turning up Interval (like 10 seconds) 
and turning down CountMax (like 2) is a better way to go, increasing 
Interval as necessary.

> After about 5 seconds, the connection is being dropped, but during that
> time the rsync is successfully transferring data near the full bandwidth
> of the connection.

Maybe you can ask SSH client/server (on both sides or at least the side 
with the most data being pushed) to turn down the SO_SNDBUF to the 
minimize the kernel buffer.  This can be done on a socket by socket 
basis using C kernel API setsockopt().  So is something ssh/sshd needs 
to implement on your behalf.

When the connection is sending if you run "netstat -tanp" (on Linux) the 
number of bytes in the kernel buffer will be shown in the Send-Q. 
Reducing SO_SNDBUF decreases this value but with the effect of causing 
the sending process to wake up more often to refill the kernel buffer. 
It sounds like your CPU processing power far exceed the network 
throughput so I do not think this will be a concern in your scenario. 
The lowest value for SO_SNDBUF according to Linux man page is 2048  bytes.

Note if you make this value too low and your CPU does not refill the 
kernel buffer and it underruns (i.e. the TCP stack could send data but 
there was none available as the application did not wakeup and write() 
data quick enough) it will mess up performance as TCP slow start 
congestion control may reset causing overall measured throughput to drop.

man 7 socket (search SO_SNDBUF)
man 7 tcp

On Linux see also /proc/sys/net/core/wmem_default for system wide 
default of SSH application does not have option.

You can 'cat /proc/sys/net/core/wmem_default' to see the current value, 
going below 32k for 100mbit (or better) ethernet system is probably a 
bad idea.

Note on the bandwidth restricted application you want to tweak it, 
setting it too low will have a major effect on normal performance of a 
normal Ethernet based system.

>
> My understanding is that since the alive mechanism is running inside the
> encrypted connection, OpenSSH would be able to (and would) prioritize
> the alive packets over other data. So if any data is able to get through
> (and it does) the alive packets should be able to as well. But this
> doesn't seem to be the case.

No.  While SSH is able to multiplex different streams inside a single 
TCP connection, the aggregated stream is still subject to kernel Send-Q 
buffering and then network latency, congestion and performance metrics.

So what are doing is taking system memory and Ethernet performance tuned 
parameters for networking (in the cause of Linux again) and trying to 
use them with bandwidth restricted connectivity.

The default OS picked wmem/sendq is based on system memory and other 
such inter-related params allowing auto-tune.

Darryl