Race condition when using ControlMaster=auto with simultaneous connections

Wed Aug 31 23:24:12 AEST 2022

Hello,

I'm trying to multiplex many simultaneous SSH connections through a single
master connection, and I'm hitting a race condition while doing this.
This is not a bug; I'm either hitting a limit in the design of OpenSSH or
misusing it.

The use-case is to use Ansible to configure many hosts simultaneously,
while all connections need to go through a single "SSH bastion" via ProxyJump.
For efficiency and to avoid hitting MaxStartups limits, I would like to
use a control master for the connection to the bastion, via the following
client configuration:

    Host bastion.example.com
      ControlMaster auto
      ControlPath /dev/shm/ssh-%h
      ControlPersist 30

    Host !bastion.example.com *.example.com
      ProxyJump bastion.example.com

However, this does not work when making simultaneous connections: all SSH
connections create a new, separate connection to the bastion.  Here is a
simple way to reproduce:

    $ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done
    ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
    ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing

What happens is the following:

1) each SSH process tries to connect to the control socket and fails
   (this is expected, the control socket is not yet bound)

2) each SSH process then creates a new SSH connection

3) once connected, each process tries to bind to the control socket

4a) one process successfully binds the control socket
4b) all other processes fail to bind the control socket (error message above)

5) in both cases, each process is now using its own separate SSH connection to the bastion

The window for the race condition is between 1) and 4), so it's rather
large: it includes the time to establish a new SSH connection.

I believe that taking a lock between steps 1) and 4) could solve the issue:

1.1) each process tries to take an exclusive lock related to the control socket
1.1a) one process gets the lock and can continue creating a SSH connection
1.1b) all other processes wait on the lock; when the lock is released, they
      go back to step 1) to connect to the control socket

4.1) once the control socket has been bound, the "lucky process" releases the lock

Does it make sense?  Would the project accept a patch implementing this as
an additional option?

Thanks,
Baptiste

-- 
Baptiste Jonglez
Research Engineer, Inria <https://www.inria.fr/>
STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>