errors when running multiple openssh sessions

jcduell at lbl.gov jcduell at lbl.gov
Tue Jun 17 06:21:07 EST 2003


On Mon, Jun 16, 2003 at 01:08:59PM -0700, Dan Kaminsky wrote:
> jcduell at lbl.gov wrote:
>
> >   #!/bin/sh
> >
> >   for NUM in 0 1 2 3 4 5 6 7 8 9; do 
> >       ssh n2003 echo $NUM "$*" &
> >   done
> >
> >So, we're running 10 ssh commands at once.  
i>
> That invocation creates quite a spike in CPU usage.  

Thanks for replying to my fragment of a post (I tried another test that
turned out to fork itself recursively, and I managed to accidentally
send my mail while trying to kill the resulting job horde:  talk about a
CPU spike ;)


> Who knows, it might also be causing headaches for privsep.  A better
> way to do what you describe above would be this:
> 
> #!/bin/sh
> 
> for NUM in 0 1 2 3 4 5 6 7 8 9; do
>        echo $NUM "$*" \&
> done | ssh n2003
> 
> From what I've found, this is the best way to execute sets of commands 
> remotely without significant CPU load.  

As my full post (below) notes, I'm writing a compiler that needs to use
ssh to as part of compilation.  If a user calls my front end with the
equivalent of

    gcc foo.c bar.c

then my script will do only one ssh for both files, but most people
write makefiles that compile each file separately.

Here's my full post.  I'm creating a bug for this in Bugzilla, too,
since I don't see anything resembling it in the existing bugs.

------------------------------------------------------------------------
Openssh seems to fail sporadically if you issue lots of simultaneous
ssh commands.  Take the following program:

    #!/bin/sh

    for NUM in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do
        ssh foo.bar.com echo $NUM &
    done

So, we're running 16 ssh commands at once, each of which just prints out a
different number.

When I run this program, several of the ssh commands fail with 

    ssh_exchange_identification: Connection closed by remote host

Interestingly, when I run 10 or fewer ssh commands, they all work OK, at
least on my linux box (I'm using OpenSSH 3.5p1-6 on Redhat Linux 9).  On 
some other platforms the number is different:  OpenSSH 3.2.3p1 on an IBM 
SP doesn't like more than 8 simultaneous ssh's in the background, while
OpenSSH_3.6.1p1 on Tru64 does around 9 max.

There doesn't seem to be any pattern in terms of which ssh's get 
killed--the first, second and third jobs (ie, those that print 0, 1, and 
2) generally run OK, but which of the following ones die seems to be
random.

This smells like some kind of race condition.

Why on earth would I want to run a dozen ssh jobs simultaneously?  I'm
writing a compiler that needs to ship some files and run some commands 
on a remote server as part of the compilation process.  The latency for
doing this is rather high, so I want to allow users to do a 'make -j' to
parallelize the build, in order to hide the network latency.

I guess for now I'll tell users to run 'make -j N' with N < 6 or so
(which is probably not a bad idea anyway).  But I can't imagine I'll be
the last person to pound on ssh like this...
------------------------------------------------------------------------

Cheers,

-- 
Jason Duell             Future Technologies Group
<jcduell at lbl.gov>       Computational Research Division
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory




More information about the openssh-unix-dev mailing list