Failure to Launch (was override -q option)

Laurence Marks L-marks at northwestern.edu
Mon Jul 29 21:57:51 EST 2013


A brief update. It was not MaxStartups which was set to 1000. As a
minor clue (perhaps), I tried hard to reproduce the problem by
repeatedly running a single 64 core mpi task, but could not. Since
when I had this problem I was running ~10 jobs at the same time (all
using 64 cores and launching 8 mpi tasks every 5-15 minutes by ssh)
this strongly suggests that it is "somehow" congestion related, Wild
guesses would include not being able to read or access a ssh related
file because it is currently open or something else.

Since it is not my cluster (and my personal one is too small to test
this on) and I cannot get access to log file information (or much
else) I've given up. I am using openmpi/mpirun instead of ssh to
create the remote connections and this seems to be 100% reliable (on
the cluster I am using).

Thanks for your (collective) help, which gave me the clues to the
patch I have ended up with.

On Mon, Jul 22, 2013 at 8:41 AM, Gert Doering <gert at greenie.muc.de> wrote:
> Hi,
>
> On Mon, Jul 22, 2013 at 08:11:40AM -0500, Laurence Marks wrote:
>> It may be that I will need to run a number of similar jobs in parallel
>> which is tricky to setup reliably with a queueing system. One
>> question: is there any conceivable way if 10-20 tasks are all trying
>> to connect via ssh at the same time that there can be an issue? They
>
> There is a limit on the number of yet-unauthenticated ssh sessions that
> the server permits.
>
> sshd_config
>
>      MaxStartups
>              Specifies the maximum number of concurrent unauthenticated con-
>              nections to the SSH daemon.  Additional connections will be
>              dropped until authentication succeeds or the LoginGraceTime
>              expires for a connection.  The default is 10.
>
> this *might* be hitting you...
>
> gert
> --
> USENET is *not* the non-clickable part of WWW!
>                                                            //www.muc.de/~gert/
> Gert Doering - Munich, Germany                             gert at greenie.muc.de
> fax: +49-89-35655025                        gert at net.informatik.tu-muenchen.de



-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
"Research is to see what everybody else has seen, and to think what
nobody else has thought"
Albert Szent-Gyorgi


More information about the openssh-unix-dev mailing list