Instrumentation for metrics
Craig Miskell
cmiskell at gitlab.com
Tue Jan 21 11:59:57 AEDT 2020
Hi,
We serve a fairly substantial number[1] of ssh connections across our
fleet. We have hit MaxStartups limits in the past and bumped it up a
few times (currently at 300), but we have no warning before the limit is
reached and connections start being dropped. What I would love is some
sort of instrumentation that could let us see the highest number of
concurrent pre-auth connections the current running instance of the
daemon has seen, so we can graph it and alert on it pro-actively (e.g.
when we get within some reasonable percentage of the actual limit), and
then decide if we need to increase MaxStartups further, scale our fleet
horizontally, or do something else.
I'm more than happy to write & contribute the code to do this
instrumentation, but I'd like to get some guidance on
direction/implementation options first, so I don't spend time writing
code which is never going to be accepted.
The most trivial approach would be to add logging to the main daemon,
either when we get within X% of MaxStartups (X being possibly
configurable), or just log the current max value every X minutes or Y
connections (perhaps at Verbose logging level?). Either would be
functional, but both feel a little bit unwieldy.
Alternatively, we could go a more complex and flexible route such as the
way haproxy does it, with a local unix socket that responds to a 'stats'
command with some simple text format. This would be more generally
usable and extensible to other metrics in future, and seems more robust
to me, although would be a more noticeable amount of work than just logging.
Are either of these approaches in keeping with current design
preferences? I'm open to any (other) approach; once the info is exposed
in *some* fashion, anyone can get it into their monitoring system of
choice via various hooks, and I think being agnostic about the actual
monitoring system is the right choice (e.g. a prometheus HTTP endpoint
exporter embedded in OpenSSH would be very very wrong).
Thanks,
Craig Miskell
SRE, GitLab
[1] ~26M/day, ~300/s avg
More information about the openssh-unix-dev
mailing list