[Bug 1363] New: sshd gets stuck: select() in packet_read_seqnr waits indefinitely

Mon Sep 17 13:02:31 EST 2007

http://bugzilla.mindrot.org/show_bug.cgi?id=1363

           Summary: sshd gets stuck: select() in packet_read_seqnr waits
                    indefinitely
           Product: Portable OpenSSH
           Version: 4.2p1
          Platform: All
               URL: http://marc.info/?t=117394251600035
        OS/Version: All
            Status: NEW
          Keywords: patch
          Severity: major
          Priority: P2
         Component: sshd
        AssignedTo: bitbucket at mindrot.org
        ReportedBy: openssh at fjarlq.com

Created an attachment (id=1348)
 --> (http://bugzilla.mindrot.org/attachment.cgi?id=1348)
latest version of fix -- this has been tested

This bug was discussed on openssh-unix-dev in March 2007:
  http://marc.info/?t=117394251600035

During the discussion, Darren Tucker created a fix for the problem and
I (Matt Day) revised and tested it. The latest version of the patch is
attached.

Original problem report:

I'm having a problem where sshd login sessions are occasionally
(as often as once a day) getting stuck indefinitely. I enabled debug
messages and got a backtrace of a stuck sshd, and I think I've found
the bug.

sshd version:
    OpenSSH_4.2p1 FreeBSD-20050903, OpenSSL 0.9.7e-p1 25 Oct 2004

Uncommented lines (ie. nondefault settings) in sshd_config:
    LogLevel DEBUG
    ClientAliveInterval 90
    Subsystem sftp /usr/libexec/sftp-server

SSH client:
    PuTTY version 0.58, default settings

OS/HW:
    FreeBSD 6.1-RELEASE running on 64-bit x86 ("amd64" platform)

Executive summary:
    The select() in packet_read_seqnr() waits indefinitely, resulting
    in stuck SSH sessions when networking problems interfere with
    key exchange. Would like to be able to set a timeout there, or
    send SSH keepalives during key exchange.

Periodically (every 60 minutes) the SSH client initiates rekeying
via key exchange. Here's an example of a successful rekeying:

Mar 11 19:02:35 SSH2_MSG_KEXINIT received
Mar 11 19:02:35 SSH2_MSG_KEXINIT sent
Mar 11 19:02:35 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 19:02:35 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 19:02:35 expecting SSH2_MSG_KEX_DH_GEX_INIT
Mar 11 19:02:38 SSH2_MSG_KEX_DH_GEX_REPLY sent
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS sent
Mar 11 19:02:38 expecting SSH2_MSG_NEWKEYS
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS received

In the failure case, sshd gets stuck during key exchange. The SSH   
session had been going fine for many hours, and then these were the 
last messages it logged:

Mar 11 20:02:38 SSH2_MSG_KEXINIT received
Mar 11 20:02:38 SSH2_MSG_KEXINIT sent
Mar 11 20:02:38 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 20:02:38 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 20:02:38 expecting SSH2_MSG_KEX_DH_GEX_INIT

The user was idle when this happened, but had a program running
that was generating output. That program became tty-blocked after
about 30 minutes, presumably because sshd wasn't draining its output,
and that's when I noticed the user's sshd was stuck and got a
backtrace:

(gdb) where
#0  0x.. in select () from /lib/libc.so.6
#1  0x.. in packet_read_seqnr () from /usr/lib/libssh.so.3
#2  0x.. in packet_read () from /usr/lib/libssh.so.3
#3  0x.. in packet_read_expect () from /usr/lib/libssh.so.3
#4  0x.. in kexgex_server (kex=0x538900) at kexgexs.c:99
#5  0x.. in kex_setup () from /usr/lib/libssh.so.3
#6  0x.. in kex_input_kexinit () from /usr/lib/libssh.so.3
#7  0x.. in dispatch_run () from /usr/lib/libssh.so.3
#8  0x.. in process_buffered_input_packets () at serverloop.c:475
#9  0x.. in server_loop2 (authctxt=0x4) at serverloop.c:760
#10 0x.. in do_authenticated2 (authctxt=0x4) at session.c:2456
#11 0x.. in do_authenticated (authctxt=0x53a400) at session.c:227
#12 0x.. in main at sshd.c:1749

This backtrace agrees with the debug messages: it's in kexgex_server(),
calling packet_read_expect(SSH2_MSG_KEX_DH_GEX_INIT), which ultimately
calls select() from packet_read_seqnr().

The select call in packet_read_seqnr passes NULL for a timeout,
meaning it will wait forever. That explains why the comment above
it says "Note that no other data is processed until this returns,
so this function should not be used during the interactive session."
But, this was an interactive session.

I've set ClientAliveInterval in sshd_config so that SSH sessions
die in a timely manner when networking problems arise, but the
keepalive is apparently not sent during key exchange. The default
TCP keepalive on FreeBSD is unhelpful here; it only kicks in after
2 hours, and I need stuck SSH sessions to die a lot sooner. I want
to keep the FreeBSD TCP keepalive defaults.

Would it be possible for the select() in packet_read_seqnr to use
an optional timeout? Similarly, I believe the select() in
packet_write_wait has the same problem. Upon timeout, it would be
fine with me if the session died with an error logged. Alternatively,
if SSH keepalives were sent during key exchange, that would suffice.

-- 
Configure bugmail: http://bugzilla.mindrot.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.