SIGCHLD race condition?

Paul Menage pmenage at ensim.com
Tue Sep 18 21:50:59 EST 2001


We use ssh (RedHat 2.5.2p2-5) heavily in non-interactive mode, for
managing servers from central controllers, and transferring applications/
data around networks.

Very occasionally we've seen the situation where the ssh client and
server are both stuck in select, both selecting on only the tcp socket
of the connection, and with no timeout. No children of sshd remain (even
as zombies), and it has no other interesting open fds.

If you send a SIGCHLD to the hung sshd, it wakes up and exits.

As far as I can see, there's a race condition in
wait_until_can_do_something(), both in RedHat 2.5.2p2-5 and in the
latest CVS sources. It tests child_terminated, and sets a non-zero
timeout if so, before calling select(). However, there is a very small
window (between checking child_terminated and calling select() in which
a SIGCHLD can arrive and set child_terminated. If this happens, and
there is no other activity from the client or the child fds, sshd can
hang indefinitely in the select().

Catching this bug in the wild is not easy, as the window for the race
condition is so small. However, it can be fairly easily reproduced under
the following slightly artificial conditions, by using gdb to pause sshd
within the window for long enough to kill the child:

Run ssh -T -x localhost 'sleep 30s; echo X; exec >&- 2>&- <&- sleep 5h'

Find the sshd serving this connection, and connect to it with gdb. It 
will be in the middle of the actual select() system call.

Set a breakpoint at the start of libc select() and continue. When the
first sleep completes, the shell will print X, and sshd will hit the
breakpoint in select().

Continue once, and the X will get printed out at the client end of the
connection, and sshd will hit the breakpoint again. 

By this time, the child shell has closed its fds and exec'd itself as
"sleep 5h". Since the breakpoint is at the start of libc select(), sshd
has checked child_terminated, but not yet invoked the select() system
call.

Send a SIGKILL to the child "sleep" process. A SIGCHLD becomes pending 
for sshd.

Quit gdb, detaching from the process. At this point, sshd receives the
SIGCHLD, sets child_terminated, and enters the select() system call. In
the absence of any external events, the client/server are now
deadlocked, even though the child has exited.

In this situation, the child is clearly visible as a zombie; we have
also seen the situation where there is no zombie child. I've not been
able to reproduce this situation.

I think that the correct way to fix this would probably be to use
something like SIGIO and sigtimedwait() rather than select(), but that
would be a substantial change. A simple fix for this problem would be to
set a maximum timeout on the select() call of e.g. 15s. Are there any
complications or bugs that could be introduced by such a change?

Paul




More information about the openssh-unix-dev mailing list