select() hangs in sftp_server_main()

Nathan Jahnke njahnke at gmail.com
Mon Apr 6 08:32:52 EST 2009


First off, a disclaimer: this is not a problem with openssh per se as
it is also occurring with other software on my server, but I was
hoping someone reading this might know more about the problem than I
do. Thank you very much in advance for your help.

Problem: connecting to the server via sftp results in a hang here:

if (select(max+1, rset, wset, NULL, NULL) < 0) {

which is line 1428 from 5.2p1's sftp-server.c (main loop of sftp_server_main()).

The same hang occurs when opening a data connection over e.g. vanilla
FTP. I am sometimes able to get through after a number of seconds or
minutes, but sometimes the connection times out on the client side
before the server is able to respond. When the server does respond and
I am connected, then if I issue e.g. 'ls' it will hang again at the
select() for some time.

ssh is OK; can connect with no delay and issue commands, etc.

I don't think it's socket death:

root at dl:~# cat /proc/net/sockstat
sockets: used 304
TCP: inuse 444 orphan 302 tw 152 alloc 451 mem 5280
UDP: inuse 4
RAW: inuse 0
FRAG: inuse 0 memory 0

root at dl:~# netstat -tan | awk '{print $6}' | sort | uniq -c
      2 CLOSE_WAIT
    121 CLOSING
      1 established)
    109 ESTABLISHED
     17 FIN_WAIT1
      9 FIN_WAIT2
      1 Foreign
    300 LAST_ACK
     20 LISTEN
      2 SYN_RECV
    433 TIME_WAIT

It also doesn't seem to be out of file descriptors but I'm not 100%
sure on that. And even if it were, wouldn't that produce an error, not
hang?

It does seem to be somewhat related to the number of connections
lighttpd is serving. I can shut down lighttpd and the problem goes
away. Having said this, lighttpd and apache are able to coexist in
this state with no problem (apache never hangs). People can also
connect to an IRC server on the same machine with no problem during
these "episodes". So maybe it is limited to select()?

What resource is lighttpd using that is not sockets/file descriptors
that is causing select() to hang? I am pulling my hair out over this.

I've tried all of the usual network tuning stuff (the various settings
through sysctl, reducing the timeouts), all with no effect. The
problem must be elsewhere.

Linux dl 2.6.18-6-486 #1 Sat Dec 27 08:57:46 UTC 2008 i686 GNU/Linux

It's running Debian Etch.

What might cause select() to hang checking some sockets?


Thanks,


Nathan


More information about the openssh-unix-dev mailing list