Hung connection over Juniper Tunnel

Darren Tucker dtucker at zip.com.au
Sat Feb 7 16:06:22 EST 2009


Damien Miller wrote:
> On Fri, 6 Feb 2009, Jason Benguerel wrote:
> 
>> Hello list!
>>
>> So I recently reconfigured our office network to allow a permanent VPN  
>> connection to our data center. This consists of a Juniper SSG-520  
>> connected via a tunnel to a Juniper Netscreen-25 over a 100M leased  
>> NTT VPN (yes I'm tunneling over the VPN as it's the only way to make  
>> it routable.) Here is where OpenSSH come in. When I try and ssh to a  
>> machine on the other end of the tunnel, I can get past the  
>> authentication stage and then it just hangs and times out. Everything  
>> else works, ping, http, and dns (ICMP, TCP and UDP in other words.)  
>> More cryptically, I can effortlessly ssh with PuTTY from a windows  
>> box. It seems that OpenSSH (or the Unix TCP/IP stack) is the only  
>> thing affected. Now I'm the first to admit that this is most likely  
>> some sort of subtle MTU or low level TCP issue, and I'm guessing the  
>> OpenSSH is the canary in the coal mine, it would be great if I can get  
>> someone to tell me why it's freezing so that I can fix the actual cause.
>>
>> There were several people complaining of similar issues, typically it  
>> turned out to be bad wireless drivers or broken routers, no direct  
>> cause was ever indicated.
> 
> There are two types of common hang:
> 
> 1) Long-lived but SSH connections being timed out of NAT/firewall state
>    after some period of quiescence. This can be worked around with the
>    ClientAliveInterval and ServerAliveInterval controls in ssh_config and
>    sshd_config respectively.
> 
> 2) Path MTU blackholes. The hang here usually occurs when either end first
>    sends a packet containing a MTU of data or more. The is no SSH-level
>    workaround for this, but the tool of choice to diagnose it is 
>    "ping -D -s xxxx yourhost" where xxxx is the packet size that your want
>    to test (start at 1492 and work down).

3) some NAT/firewalls seem to choke when Nagle gets disabled on an 
established connection (that's what's happening when the debug output 
says "setting TCP_NODELAY" immediately before your connection freezes, 
which is what makes me think that's the problem here).

You can test this theory by editing misc.c:set_nodelay() and adding a 
"return;" immediately after the variable declarations (this is around 
line 138 in recent versions) and recompiling ssh.

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
     Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.


More information about the openssh-unix-dev mailing list