sshd hangs

Martin Dudle martin.dudle at mgb.ch
Thu Jan 20 03:29:50 EST 2005


using openssh-3.8.1p1 from sunfreeware.com on a SunOS XXX 5.8 
Generic_117000-03 sun4u sparc SUNW,Sun-Fire-V240.

sshd seems to ignore or miss SIGCLD. this is a rare behaviour we observe 
about once per week in a ssh intensive environment.

the process hangs here:

truss:
24453:  poll(0xFFBEEF28, 2, -1)         (sleeping...)        

gcore, mdb:
libc.so.1`_poll+4(b, 0, 0, ffbeef38, 6fc40, ffbeef28)
0x20710(ffbef084, ffbef080, ffbef07c, ffbef078, 0, 1)
server_loop2+0xd4(6a800, 0, 0, ff1e8000, 2151c, 1)
do_authenticated+0x80(753b0, 6a400, f90, 1, 2151c, 6d800)
main+0xbf4(2f, 6fc00, 6a800, 1ecc, 1, 6dbd0)
_start+0x5c(0, 0, 0, 0, 0, 0)

the corresponding c sources are:

void
server_loop2(Authctxt *authctxt)
{
[ ... ]
        for (;;) {
                process_buffered_input_packets();

                rekeying = (xxx_kex != NULL && !xxx_kex->done);

                if (!rekeying && packet_not_very_much_data_to_write())
                        channel_output_poll();
                wait_until_can_do_something(&readset, &writeset, &max_fd,
                    &nalloc, 0);
[ ...]

and it hangs in the select() call in wait_until_can_do_something().


question: why is the wait time set to 0 (= wait forever) ? server_loop() 
(the interactive function) does not set it to 0.
if the child exits without the parent noting it then we hung forever 
which is bad.


i tried to send the process a SIGCLD by hand to see if it would 'unlock' 
itself. here's the result:

# kill -CLD 24453

truss:
24453:      Received signal #18, SIGCLD, in poll() [caught]
24453:  poll(0xFFBEEF28, 2, -1)                         Err#4 EINTR
24453:  sigaction(SIGCLD, 0x00000000, 0xFFBEEA90)       = 0
24453:  write(6, "\0", 1)                               = 1
24453:  setcontext(0xFFBEEC10)
24453:  sigprocmask(SIG_BLOCK, 0xFFBEEFE8, 0xFFBEEFF8)  = 0
24453:  waitid(P_ALL, 0, 0xFFBEEF00, WEXITED|WTRAPPED|WNOHANG) Err#10 ECHILD
24453:  sigprocmask(SIG_SETMASK, 0xFFBEEFF8, 0x00000000) = 0
24453:  poll(0xFFBEEF28, 2, -1)                         = 1
24453:  read(4, "\0", 1)                                = 1
24453:  read(4, 0xFFBEEF8F, 1)                          Err#11 EAGAIN
24453:  sigprocmask(SIG_BLOCK, 0xFFBEEFE8, 0xFFBEEFF8)  = 0
24453:  sigprocmask(SIG_SETMASK, 0xFFBEEFF8, 0x00000000) = 0
24453:  poll(0xFFBEEF28, 2, -1)         (sleeping...)

it seems there is another problem here with collect_children() not 
handling ECHILD:
{
        pid_t pid;
        sigset_t oset, nset;   
        int status;

        /* block SIGCHLD while we check for dead children */
        sigemptyset(&nset);
        sigaddset(&nset, SIGCHLD);
        sigprocmask(SIG_BLOCK, &nset, &oset);
        if (child_terminated) {
                while ((pid = waitpid(-1, &status, WNOHANG)) > 0 ||
                    (pid < 0 && errno == EINTR))          
                        if (pid > 0)             
                                session_close_by_pid(pid, status);
                child_terminated = 0;
        }       
        sigprocmask(SIG_SETMASK, &oset, NULL);
}

waitpid returns -1 with errno == ECHILD. child_terminated is set to 
FALSE (why?) and that's it.
the program returns to the endless loop (for (;;)) in server_loop2() and 
sleeps forever again.

could anyone shed some light into this thoughts? thanks.

regards,
-martin dudle




More information about the openssh-unix-dev mailing list