ssh client does not timeout if the network fails after ssh_connect but before ssh_exchange_identification, even with Alive options set

Thu Jul 26 08:12:24 EST 2007

Hello again,

Here is the patch I came up with to prevent the hanging in
ssh_exchange_identification. I tested it a little bit and it seems to have
solved the problem. Could anyone help to have a look at the patch? Thanks a
lot!

--- sshconnect.c~old    2007-07-25 10:44:26.000000000 -0700
+++ sshconnect.c        2007-07-25 14:45:57.000000000 -0700
@@ -404,9 +404,26 @@ ssh_exchange_identification(void)
        int minor1 = PROTOCOL_MINOR_1;
        u_int i, n;

+       if (options.server_alive_interval) {
+               fd_set rfds;
+               struct timeval timeo = { .tv_usec=0 };
+               int read_timeouts, ret;
+
+               FD_SET(connection_in, &rfds);
+               for (read_timeouts = 0;;) {
+                       timeo.tv_sec = options.server_alive_interval;
+                       ret = select(connection_in+1, &rfds, NULL, NULL,
&timeo);
+                       if (ret < 0) {
+                               fatal("ssh_exchange_identification: select
read error: %.100s", strerror(errno));
+                       } else if (ret == 0) {
+                               if (++read_timeouts >=
options.server_alive_count_max)
+                                       fatal("ssh_exchange_identification:
Timeout, server not responding");
+                       } else
+                               break;
+               }
+
+       }
        /* Read other side's version identification. */
-       struct timeval timeo = { .tv_sec=10, .tv_usec=0 };
-       setsockopt(connection_in, SOL_SOCKET, SO_SNDTIMEO, &timeo,
sizeof(timeo));
        for (n = 0;;) {
                for (i = 0; i < sizeof(buf) - 1; i++) {
                        size_t len = atomicio(read, connection_in, &buf[i],
1);
@@ -490,6 +507,25 @@ ssh_exchange_identification(void)
            compat20 ? PROTOCOL_MAJOR_2 : PROTOCOL_MAJOR_1,
            compat20 ? PROTOCOL_MINOR_2 : minor1,
            SSH_VERSION);
+       if (options.server_alive_interval) {
+               fd_set wfds;
+               struct timeval timeo = { .tv_usec=0 };
+               int write_timeouts, ret;
+
+               FD_SET(connection_out, &wfds);
+               for (write_timeouts = 0;;) {
+                       timeo.tv_sec = options.server_alive_interval;
+                       ret = select(connection_out+1, NULL, &wfds, NULL,
&timeo);
+                       if (ret < 0) {
+                               fatal("ssh_exchange_identification: select
write error: %.100s", strerror(errno));
+                       } else if (ret == 0) {
+                               if (++write_timeouts >=
options.server_alive_count_max)
+                                       fatal("ssh_exchange_identification:
Timeout, server not responding");
+                       } else
+                               break;
+               }
+
+       }
        if (atomicio(vwrite, connection_out, buf, strlen(buf)) !=
strlen(buf))
                fatal("write: %.100s", strerror(errno));
        client_version_string = xstrdup(buf);


Jiaying

On 7/24/07, Jiaying Zhang <jiayingz at google.com> wrote:
>
> Hello,
>
> I am testing ssh with occasional network disconnection between server and
> client during these days. I found ssh sometimes hangs if the disconnection
> happens after the connection is established but before
> ssh_exchange_identification completes. The ssh configuration files show that
> both client and server alive options are set.
> In /etc/ssh/ssh_config:
> # Send keepalive messages to the server. Disconnect after 90 seconds.
>   ServerAliveInterval 30
>   ServerAliveCountMax 3
> In /etc/ssh/sshd_config:
> # ClientAlive is more flexible and secure than TCPKeepAlive. (ssh2)
> # Send an alive messages every 30 seconds, and disconnect after 90
> seconds.
> ClientAliveInterval 30
> ClientAliveCountMax 3
>
> The ssh client kept hanging even after the network was resumed. It finally
> timed out after about 2 hours because the tcp_keepalive_time is set as 2
> hours in sysctl.
> I looked at the ssh code downloaded from your website and found the Alive
> options are only used to setup timeout after ssh_session starts. So my
> question is why we do not start monitoring the liveness of ssh server right
> after a connection is established. It is annoying when an application relies
> on ssh to do periodic work but an occasional network failure causes the
> application to miss several service circles due to ssh hanging.
>
> Thanks a lot!
>
> Jiaying
>
>