how to troubleshoot ssh multiplexing hanging issues?

Ken Chang z4242814375 at gmail.com
Wed Sep 13 04:32:35 AEST 2017


hello ssh list,

long time user of openssh, but relatively new to the concept of ssh
multiplexing. i'm experiencing some issues and i haven't figured out how to
troubleshoot it just yet. would appreciate some help if possible.

i'm using ssh as a communications mechanism to pass text file based
messages between 2 hosts. There are programs on each side that send and
receive these messages. When I found out about ssh multiplexing, i was
excited to use it because we were seeing several hundred ssh connections
going back and forth between the 2 hosts. when i tried ssh multiplexing,
the message latency dropped dramatically by 10 fold! however, now that this
mechanism has been in use for a week, I'm starting to see some problems.

First, this is the .ssh/config contents:

Host *
  ControlPath ~/.ssh/cm-%r@%h:%p
  ControlMaster auto
  ControlPersist 10m


Everything seems to work for a few days, but then ssh starts to hang, and
we start seeing several hundred ssh processes all trying to send their
message but cannot. When i try to run ssh by hand, this is what i get:

$ ssh -vvv boss at ui1
OpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /var/lib/worker/.ssh/config
debug1: /var/lib/worker/.ssh/config line 1: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug1: auto-mux: Trying existing master

And it hangs at that point indefinitely until Ctrl-C.

At this point in time, we do see the ssh mux process still running:

$ ps -eo pid,user,args | awk '$2=="worker" && $3=="ssh:" && $5=="[mux]"
{print}'
29305 worker   ssh: /var/lib/worker/.ssh/cm-boss at ui1:22 [mux]

I tried to attach strace to the ssh mux process, and this is what i see
when the problem is happening:

select(1024, [3 5 9], [], NULL, {0, 11336}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778030739}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778085461}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778109973}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778186890}) = 0
accept(4, 0x7ffe26b34360, [128])        = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778263743}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778298340}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873813, 778343707}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778457543}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778518096}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778546349}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778627517}) = 0
accept(4, 0x7ffe26b34360, [128])        = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778693493}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778725395}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873814, 778749417}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778904087}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778963540}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 778988943}) = 0
select(1024, [3 4 5 9], [], NULL, NULL) = 1 (in [4])
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779072887}) = 0
accept(4, 0x7ffe26b34360, [128])        = -1 EMFILE (Too many open files)
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779158255}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779191597}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873815, 779216201}) = 0
select(1024, [3 5 9], [], NULL, {1, 0}) = 0 (Timeout)
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779334945}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779393178}) = 0
clock_gettime(0x7 /* CLOCK_??? */, {17873816, 779418473}) = 0

Does this indicate a open file limit for this user? Or is this something
else? This is ulimit -a for that user:

-bash-4.2$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2062375
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Any advice on how to troubleshoot this further? Thanks in advance...


More information about the openssh-unix-dev mailing list