HP-UX slow login problem found?

Deron Meranda dmeranda at iac.net
Fri Jul 12 17:54:29 EST 2002


I think I finally figured out the problem that many people have been
having with extremely long login times under HP-UX 11.x.  The problem
is really in OpenSSL, and in particular the Diffie-Hellman parameter
generation routines under the PA-RISC processor.  I suspect this may
not be a problem with the IA64 (Itanium) processors. This especially
shows up if you use the gcc compiler.  Fortunately I have access to
Rational Quantify, a very powerful profiler which led me down to just
a few lines of assembly code causing almost the whole delay.

I finally have an ssh/sshd executable under HP which logs in almost
instantaneously.  I wouldn't consider this a complete solution yet,
especially if you don't have access to HP's ANSI C compiler, and I
haven't thoroughly tested this whole configuration.  But this
information may still prove quite useful.

I'm using the latest of everything...

  OpenSSH 3.4p1
  OpenSSL 0.9.7 Beta 2
  libz 1.1.4
  gcc 3.1 (using gas from binutils 2.12.1)
  HP ANSI C compiler (version B.11.01.06)

Although this is a 64-bit OS, I'm compiling everything in 32-bit mode.

I'm running on an 9000/L2000-44 under HP-UX 11.0.  This is a
two-processor 440MHz PA-RISC 2.0 system.  If you only have a PA-RISC
1.x processor I think you may still be out of luck??  You can check
your processor version by running the command "getconf CPU_VERSION".
If it returns 532 or higher you have a 2.0 processor.

There are basically two extremely slow routines in OpenSSL which show
up if you compile it "out of the box": RSA operations and DH parameter
generation.  You can test how fast these are with the following...

  $ openssl speed rsa   # tests all RSA operations
  $ openssl dhparam -text 128   # generates DH parameters (128-bit)

The RSA test is pretty accurate--you can compare this with other
systems like Linux on a PC.  The DH test is unfortunately very
random..some runs will be quick and others slow.  You'll have to run
it many times and with different bit sizes to guage how slow it is.
Again, comparing to a Linux box may be useful.  You will almost
defintely see the HP version being much slower than Linux/Intel (on
Pentium3/Athlon).  This is because in practice the Intel chips seem to
have much faster integer performance; whereas the PA-RISC is much
faster with floating point.  Unfortunately for you, most crypto is
integer based.  Just to give you a comparison point, here's my numbers
(after optimizing it as described below)...

                     sign    verify    sign/s verify/s
   rsa  512 bits   0.0023s   0.0002s    432.2   5402.4
   rsa 1024 bits   0.0094s   0.0005s    106.8   2132.8
   rsa 2048 bits   0.0519s   0.0014s     19.3    690.2
   rsa 4096 bits   0.3258s   0.0049s      3.1    203.4

Without my changes, even with gcc -O3, my speeds were about 100 times
slower!  The DH speed is much harder to measure, but it was definitely
real slow with the gcc compiled version.

Okay, what's going on inside the OpenSSL code....  there are two small
functions which are responsible for about 95% of the CPU clock cycles.
These are bn_mul_add_words() in the file crypto/bn/bn_asm.c and the
function BN_mod_word() in the file crypto/bn/bn_word.c.  The first is
responsible for the miserable RSA speeds, and the later for the
horrible DH speeds.  I'll discuss how to speed each of these up
separately.

The bn_mul_add_words() function is by default implemented in the file
bn_asm.c.  However, neither the gcc or HP C compiler seem to be able
to optimize that implementation very well.  As that function can be
called thousands if not millions of times, every last clock cycle is
extremely important.  Fortunately there is some hand-crafted assembly
code in an alternate implementation.  It can be found in the OpenSSL
distribution in the file crypto/bn/asm/pa-risc2.s.  You need to use
that file instead of the generic bn_asm.c file.  However, there are
some restrictions...that file only works with HP's assembler (not
gas), only on PA-RISC 2.0 systems, and it is not relocatable/PIC
(can't be used in a shared library).

I haven't completely figured out OpenSSL's non-standard configure
scripts.  But it is easy enough to just assemble it yourself and then
replace that object in the libcrypto.a library.

   ar d libcrypto.a bn_asm.o
   ar r libcrypto.a pa-risc2.o
   ranlib libcrypt.a

Then relink the openssl executable.  Rerun your RSA speed
test..hopefully the results should be very pleasant.


Now, for the Diffie-Hellman part (the primary reason for SSH
slowness).  There is no assembly version of the bn_word.c file.  And
unfortunately gcc's optimizer, even with gcc 3.1 and with -O3 and
-march=2.0, is pretty poor.  This basically is because gcc invokes
some millicode routines to do the 64-bit modulus "%" operation.  I've
found though that HP's ANSI C compiler with the correct optimization
arguments is able to produce some PA-RISC 2.0 specific instructions
which make it very fast in comparison (say by 100 clock cycles).

   cc +O3 +ESlit +DA2.0 +DS2.0 -Ae \
      -DOPENSSL_THREADS -D_REENTRANT -DDSO_DL -DOPENSSL_NO_KRB5 \
      -I/opt/gnu/include \
      -DOPENSSL_NO_RC5 -DOPENSSL_NO_IDEA -D_REENTRANT \
      -DB_ENDIAN -DMD32_XARRAY -c bn_word.c -o bn_word.o

Also throw in +Z if you're trying to make a shared library (but see
note about pa-risc2.s file above).

Except for those two files (pa-risc2.s and bn_word.c), you can use gcc
for everything else.  I've been using gcc 3.1, with -O3 -march=2.0

Now, if all goes well, you'll have a new libcrypto.a.  Compile and
link OpenSSH against that one and you should see fast logins, finally!
Note, both the server (sshd) and the client (ssh) need to be
recompiled/relinked, as both generate their half of the DH parameters.

Deron Meranda



More information about the openssh-unix-dev mailing list