sshd gets stuck: select() in packet_read_seqnr waits indefinitely

Matt Day opensshbugs at fjarlq.com
Thu Mar 15 05:53:09 EST 2007


Dear OpenSSH Portable sshd developers,

I'm having a problem where sshd login sessions are occasionally
(as often as once a day) getting stuck indefinitely. I enabled debug
messages and got a backtrace of a stuck sshd, and I think I've found
the bug. I wanted to run it by the list once before filing.

sshd version:
    OpenSSH_4.2p1 FreeBSD-20050903, OpenSSL 0.9.7e-p1 25 Oct 2004

Uncommented lines (ie. nondefault settings) in sshd_config:
    LogLevel DEBUG
    ClientAliveInterval 90
    Subsystem sftp /usr/libexec/sftp-server

SSH client:
    PuTTY version 0.58, default settings

OS/HW:
    FreeBSD 6.1-RELEASE running on 64-bit x86 ("amd64" platform)

Executive summary:
    The select() in packet_read_seqnr() waits indefinitely, resulting
    in stuck SSH sessions when networking problems interfere with
    key exchange. Would like to be able to set a timeout there, or
    send SSH keepalives during key exchange.

Periodically (every 60 minutes) the SSH client initiates rekeying
via key exchange. Here's an example of a successful rekeying:

Mar 11 19:02:35 SSH2_MSG_KEXINIT received
Mar 11 19:02:35 SSH2_MSG_KEXINIT sent
Mar 11 19:02:35 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 19:02:35 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 19:02:35 expecting SSH2_MSG_KEX_DH_GEX_INIT
Mar 11 19:02:38 SSH2_MSG_KEX_DH_GEX_REPLY sent
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS sent
Mar 11 19:02:38 expecting SSH2_MSG_NEWKEYS
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS received

In the failure case, sshd gets stuck during key exchange. The SSH
session had been going fine for many hours, and then these were the
last messages it logged:

Mar 11 20:02:38 SSH2_MSG_KEXINIT received
Mar 11 20:02:38 SSH2_MSG_KEXINIT sent
Mar 11 20:02:38 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 20:02:38 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 20:02:38 expecting SSH2_MSG_KEX_DH_GEX_INIT

The user was idle when this happened, but had a program running
that was generating output. That program became tty-blocked after
about 30 minutes, presumably because sshd wasn't draining its output,
and that's when I noticed the user's sshd was stuck and got a backtrace:

(gdb) where
#0  0x.. in select () from /lib/libc.so.6
#1  0x.. in packet_read_seqnr () from /usr/lib/libssh.so.3
#2  0x.. in packet_read () from /usr/lib/libssh.so.3
#3  0x.. in packet_read_expect () from /usr/lib/libssh.so.3
#4  0x.. in kexgex_server (kex=0x538900) at kexgexs.c:99
#5  0x.. in kex_setup () from /usr/lib/libssh.so.3
#6  0x.. in kex_input_kexinit () from /usr/lib/libssh.so.3
#7  0x.. in dispatch_run () from /usr/lib/libssh.so.3
#8  0x.. in process_buffered_input_packets () at serverloop.c:475
#9  0x.. in server_loop2 (authctxt=0x4) at serverloop.c:760
#10 0x.. in do_authenticated2 (authctxt=0x4) at session.c:2456
#11 0x.. in do_authenticated (authctxt=0x53a400) at session.c:227
#12 0x.. in main at sshd.c:1749

This backtrace agrees with the debug messages: it's in kexgex_server(),
calling packet_read_expect(SSH2_MSG_KEX_DH_GEX_INIT), which ultimately
calls select() from packet_read_seqnr().

The select call in packet_read_seqnr passes NULL for a timeout,
meaning it will wait forever. That explains why the comment above
it says "Note that no other data is processed until this returns,
so this function should not be used during the interactive session."
But, this was an interactive session.

I've set ClientAliveInterval in sshd_config so that SSH sessions
die in a timely manner when networking problems arise, but the
keepalive is apparently not sent during key exchange. The default
TCP keepalive on FreeBSD is unhelpful here; it only kicks in after
2 hours, and I need stuck SSH sessions to die a lot sooner. I want
to keep the FreeBSD TCP keepalive defaults.

Would it be possible for the select() in packet_read_seqnr to use
an optional timeout? Similarly, I believe the select() in
packet_write_wait has the same problem. Upon timeout, it would be
fine with me if the session died with an error logged. Alternatively,
if SSH keepalives were sent during key exchange, that would suffice.

Thanks for reading, and thanks for making OpenSSH. I've been a happy
user for many years now!

Matt Day <opensshbugs at fjarlq.com>


More information about the openssh-unix-dev mailing list