question on scalability

Andrey Ermolinskiy andrey at us.ibm.com
Sat Nov 22 04:20:25 EST 2003


Hello All,

We have a Linux cluster application that uses openssh as its inter-node
communication mechanism and we've recently run into a problem that points
to a potential scalability issue in openssh code.

Our client nodes systematically open ssh connections to the server node to
execute an administrative command. When establishing socket connections,
the server side sometimes fails to complete the TCP handshake with some of
the clients.  The final ACK coming from the client node would sometimes be
dropped by server-side TCP, and the corresponding connection would never be
added to sshd's accept queue. This leaves the ssh client command in a hung
state, as it has completed its part of the TCP handshake and is ready to
exchange data over the socket.

This problem reveals itself in situations where 64 or more client nodes
issue concurrent ssh requests to the server.

Looking at sshd.c, I noticed that the daemon's listen socket is created
with a very short backlog value (5), and we are certain that this is the
cause of our problem. Is there a reason for using such a small value, as
opposed to setting the backlog to SOMAXCONN?

We need to scale our application to clusters with thousands of nodes and we
are trying to determine whether openssh would permit us to achieve these
scaling requirements. If the increase of sshd's backlog has no negative
implications, we would like to see this value increased to SOMAXCONN.  I
think that such change would make openssh a more reliable tool for
clustered environments.

Any help or feedback from you would be appreciated.

Thanks,

- Andrey Ermolinskiy





More information about the openssh-unix-dev mailing list