Hanging ssh session...

James Oden joden at eworld.wox.org
Tue Oct 9 04:52:20 EST 2001


Hi All,

I am not sure if this is the same thing as the hang on exit bug, so sorry if
this is a duplication of previous stuff.

Essetntially I am experiencing ssh hangs with about .5% - 1% of my
connections.  I am running 2.9p2, on Solaris 7.  I actually have empirical
data on the hangings, as I wrote a script to create these connections 
in an endless loop, setting an alarm so I could recover from a hang. 
I will place this script at the end of my email.

I am using RSA authentication with no passwords, going over an etherenet 
network.

Here is a dump of running strace and pstack on the remote and local
ssh sessions:

Local:
        truss:
                poll(0xFFBEEFC0, 2, -1)         (sleeping...)
        pstack:
                10879:  ssh epapdev at mate ls /etc/hosts
                 ff217cfc poll     (ffbeefc0, 2, ffffffff)
                 ff1cf6b0 select   (ffbeefd0, ff238bc4, 14b480, ff238bc8, 14b484, a) + 298
                 0004cc44 client_wait_until_can_do_something (ffbef200, ffbef1fc, ffbef1e4, 0, 9, 10000)
+ 3c4
                 0004e8a4 client_loop (0, ffffffff, 0, 14afb8, ff235ad4, 85308) + 6d4
                 00040c94 ssh_session2 (14afb8, 2, ffbef684, 141684, 144da8, 144da8) + 11c
                 0003f41c main     (4, ffbef50c, ffbef520, 131c00, 0, 0) + 1cd4
                 0003cfbc _start   (0, 0, 0, 0, 0, 0) + dc
Remote:
        truss:
                poll(0xFFBEF558, 2, -1)         (sleeping...)
        pstack
                15390:  /opt/TKLCplat/sbin/sshd
                 ff217cfc poll     (ffbef558, 2, ffffffff)
                 ff1cf6b0 select   (ffbef568, ff238bc4, 153230, ff238bc8, 153234, c) + 298
                 00052128 wait_until_can_do_something (ffbef6dc, ffbef6d8, ffbef6d0, 0, 0, 0) + 500
                 0005387c server_loop2 (0, 0, 0, 0, 0, 0) + 19c
                 0005ab60 do_authenticated2 (153ea0, 0, 0, 0, ff235ad4, 54bd0) + 8
                 00054c40 do_authenticated (153ea0, 153ea0, 153ea0, 2000, ffff, 0) + b0
                 0004435c do_authentication2 (1187a0, 7, c30b, ffbefd64, ff235ad4, 41888) + d4
                 00041914 main     (1, ffbefdec, ffbefdf4, 138c00, 0, 0) + 267c
                 0003dedc _start   (0, 0, 0, 0, 0, 0) + dc

truss only yields one call because I am calling it on the process after
the fact.  The one thing I can see with my limited experience is that
both the remote and local processes are in the poll call with no timeout.
Since they are both polling forever, they are in a deadlock I suppose.

I have been somewhat following the hang on exit thread and gathered that 
this might have something to do with tty's so I tried using the -T switch.
This over a six hour period yields the same ratio of hangs to successes 
as not using the switch.

Is there any work around available for this?  Also, do you need any more 
information from me.  If needed I could change my program to run truss 
on every attepted session and save the results of the hung sessions.

Cheers...james

P.S. you must change the $host and $login variables to an RSA authenticated
machine of your choosing in the script below.

<<<Test Program follows>>>
#
my $count   = 0;
my $test    = "~~~ ring ~~~ ring ~~~~\n";
my $sshhung = 0;
my $success = 0;
my $evalerr = 0;
my $pid;
my $childpid;
my $rc;
my $login = "";		# Place your login here
my $host  = "";		# place your host here.

while(1)
{
        $count++;
        print <<EOF;
Test #${count}
=========================
      Hangs:  ${sshhung}
Eval Errors:  ${evalerr}
    Success:  ${success}
EOF

        eval {
                local $SIG{ALRM} = sub { die $test };
                $pid = fork();
                die "Could not fork!" if($pid eq '');
                if($pid == 0)   # I am the child
                {
                        exec('ssh', "${login}@${host}", 'ls', '/etc/hosts');
                        die "EXEC FAILED!!!";
                }       # End of child

                #
                # Ok, back in the parental role...
                $childpid = '';
                alarm(10);
                $childpid = wait();
                $rc = $? >> 8;
                alarm(0);
        };


        #if we timed out then keys have not been exchanged.
        # If any other error occurs we should die.
        if($@)
        {
                # Any sort of error in hear means that the child
                # May still be alive...time to die...
                kill(9, $pid);

                #
                # Was this a syntax error?
                if ($@ ne $test) { $evalerr++; }
                else             { $sshhung++; }
        }
        else { $success++; }
}





More information about the openssh-unix-dev mailing list