AW: sshd hangs

Martin.Dudle at mgb.ch Martin.Dudle at mgb.ch
Tue Jan 25 02:22:06 EST 2005


hello

applied the patch described below - unfortunately we still experience
rare hangs of the remote sshd. not surprising as the patch only changes
a few lines in server_loop() - but not in server_loop2() which i used
for non-interactive sessions.

process id of hanging sshd: 26110

process is sleeping forever in poll (why does server_loop2() sleep
forever?):
root at XXX:~# truss -fp 26110
26110:  poll(0xFFBEF268, 2, -1)         (sleeping...)

no child processes are around:
root at XXX:~# ps -ef | grep 26110
    root 26110 24012  0 14:50:11 ?        0:00 /usr/local/sbin/sshd
    root  8136  7433  0 15:15:34 pts/5    0:00 truss -fp 26110
    root  8217  7433  0 15:15:55 pts/5    0:00 grep 26110

sending it a SIGCLD to see if ECHILD would have been handled fine (would
not :-/).
root at XXX:~# kill -CLD 26110
26110:      Received signal #18, SIGCLD, in poll() [caught]
26110:  poll(0xFFBEF268, 2, -1)                         Err#4 EINTR
26110:  sigaction(SIGCLD, 0x00000000, 0xFFBEEDD0)       = 0
26110:  write(6, "\0", 1)                               = 1
26110:  setcontext(0xFFBEEF50)
26110:  sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338)  = 0
26110:  waitid(P_ALL, 0, 0xFFBEF240, WEXITED|WTRAPPED|WNOHANG) Err#10
ECHILD
26110:  sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110:  poll(0xFFBEF268, 2, -1)                         = 1
26110:  read(4, "\0", 1)                                = 1
26110:  read(4, 0xFFBEF2CF, 1)                          Err#11 EAGAIN
26110:  sigprocmask(SIG_BLOCK, 0xFFBEF328, 0xFFBEF338)  = 0
26110:  sigprocmask(SIG_SETMASK, 0xFFBEF338, 0x00000000) = 0
26110:  poll(0xFFBEF268, 2, -1)         (sleeping...)

stack:
> $c
libc.so.1`_poll+4(b, 0, 0, ffbef278, 68dc8, ffbef268)
0x1f278(ffbef3c4, ffbef3c0, ffbef3bc, ffbef3b8, 0, 1)
server_loop2+0xe0(6e518, 0, 0, ff078000, 2151c, 1)
do_authenticated+0x80(6e518, 6e518, 6e518, ffbef4c4, 2151c, 66000)
main+0xc28(2e, 68d88, 64000, 1, 1ed0, 66674)
_start+0x5c(0, 0, 0, 0, 0, 0)

disassemble trace:
server_loop2+0xe0:              call      -0x102c       <0x1f118>

0x1f0f0:                        sethi     %hi(0x46c00), %o0
...
0x1f24c:                        add       %fp, -0x18, %o4
0x1f250:                        sll       %o0, 5, %g1
0x1f254:                        sub       %g1, %o0, %g1
0x1f258:                        sll       %g1, 2, %g1
0x1f25c:                        add       %g1, %o0, %g1
0x1f260:                        sll       %g1, 3, %g1
0x1f264:                        st        %g1, [%fp - 0x14]
0x1f268:                        ld        [%i2], %o0
0x1f26c:                        ld        [%i0], %o1
0x1f270:                        ld        [%i1], %o2
0x1f274:                        add       %o0, 1, %o0
0x1f278:                        call      +0x439b8
<PLT=libc.so.1`select>
0x1f27c:                        clr       %o3

c code (patched):
static void
collect_children(void)
{
        pid_t pid;
        sigset_t oset, nset;
        int status;

        /* block SIGCHLD while we check for dead children */
        sigemptyset(&nset);
        sigaddset(&nset, SIGCHLD);
        sigprocmask(SIG_BLOCK, &nset, &oset);
        if (child_terminated) {
                while ((pid = waitpid(-1, &status, WNOHANG)) > 0 ||
                    (pid < 0 && errno == EINTR))
                        if (pid > 0)
                                session_close_by_pid(pid, status);
                child_terminated = 0;
        }
        sigprocmask(SIG_SETMASK, &oset, NULL);
}


while there could be code to remove the hang (have select() in
server_loop2() not wait forever, have collect_children detect and handle
ECHILD properly) i think that the child process should not die or
terminate undetected by the parent in the first place.

will try to find why this happens and let you know if i find something.

regards,
-martin



Martin Dudle wrote:
> using openssh-3.8.1p1 from sunfreeware.com on a SunOS XXX 5.8
> Generic_117000-03 sun4u sparc SUNW,Sun-Fire-V240.
> 
> sshd seems to ignore or miss SIGCLD. this is a rare behaviour we 
> observe
> about once per week in a ssh intensive environment.

Try the patch attached to this bug:
http://bugzilla.mindrot.org/show_bug.cgi?id=967

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
     Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.




More information about the openssh-unix-dev mailing list