Race condition when using ControlMaster=auto with simultaneous connections
Baptiste Jonglez
baptiste.jonglez at inria.fr
Wed Aug 31 23:24:12 AEST 2022
Hello,
I'm trying to multiplex many simultaneous SSH connections through a single
master connection, and I'm hitting a race condition while doing this.
This is not a bug; I'm either hitting a limit in the design of OpenSSH or
misusing it.
The use-case is to use Ansible to configure many hosts simultaneously,
while all connections need to go through a single "SSH bastion" via ProxyJump.
For efficiency and to avoid hitting MaxStartups limits, I would like to
use a control master for the connection to the bastion, via the following
client configuration:
Host bastion.example.com
ControlMaster auto
ControlPath /dev/shm/ssh-%h
ControlPersist 30
Host !bastion.example.com *.example.com
ProxyJump bastion.example.com
However, this does not work when making simultaneous connections: all SSH
connections create a new, separate connection to the bastion. Here is a
simple way to reproduce:
$ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done
ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
What happens is the following:
1) each SSH process tries to connect to the control socket and fails
(this is expected, the control socket is not yet bound)
2) each SSH process then creates a new SSH connection
3) once connected, each process tries to bind to the control socket
4a) one process successfully binds the control socket
4b) all other processes fail to bind the control socket (error message above)
5) in both cases, each process is now using its own separate SSH connection to the bastion
The window for the race condition is between 1) and 4), so it's rather
large: it includes the time to establish a new SSH connection.
I believe that taking a lock between steps 1) and 4) could solve the issue:
1.1) each process tries to take an exclusive lock related to the control socket
1.1a) one process gets the lock and can continue creating a SSH connection
1.1b) all other processes wait on the lock; when the lock is released, they
go back to step 1) to connect to the control socket
4.1) once the control socket has been bound, the "lucky process" releases the lock
Does it make sense? Would the project accept a patch implementing this as
an additional option?
Thanks,
Baptiste
--
Baptiste Jonglez
Research Engineer, Inria <https://www.inria.fr/>
STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>
More information about the openssh-unix-dev
mailing list