HP-UX slow login problem found?

Kevin Steves kevin at atomicgears.com
Sat Jul 13 05:35:41 EST 2002


./Configure hpux-parisc2-cc

will pull in asm/pa-risc2.o

I'll copy Chris (author of that code) in case he has
any thoughts.

On Fri, Jul 12, 2002 at 03:54:29AM -0400, Deron Meranda wrote:
> I think I finally figured out the problem that many people have been
> having with extremely long login times under HP-UX 11.x.  The problem
> is really in OpenSSL, and in particular the Diffie-Hellman parameter
> generation routines under the PA-RISC processor.  I suspect this may
> not be a problem with the IA64 (Itanium) processors. This especially
> shows up if you use the gcc compiler.  Fortunately I have access to
> Rational Quantify, a very powerful profiler which led me down to just
> a few lines of assembly code causing almost the whole delay.
> 
> I finally have an ssh/sshd executable under HP which logs in almost
> instantaneously.  I wouldn't consider this a complete solution yet,
> especially if you don't have access to HP's ANSI C compiler, and I
> haven't thoroughly tested this whole configuration.  But this
> information may still prove quite useful.
> 
> I'm using the latest of everything...
> 
>   OpenSSH 3.4p1
>   OpenSSL 0.9.7 Beta 2
>   libz 1.1.4
>   gcc 3.1 (using gas from binutils 2.12.1)
>   HP ANSI C compiler (version B.11.01.06)
> 
> Although this is a 64-bit OS, I'm compiling everything in 32-bit mode.
> 
> I'm running on an 9000/L2000-44 under HP-UX 11.0.  This is a
> two-processor 440MHz PA-RISC 2.0 system.  If you only have a PA-RISC
> 1.x processor I think you may still be out of luck??  You can check
> your processor version by running the command "getconf CPU_VERSION".
> If it returns 532 or higher you have a 2.0 processor.
> 
> There are basically two extremely slow routines in OpenSSL which show
> up if you compile it "out of the box": RSA operations and DH parameter
> generation.  You can test how fast these are with the following...
> 
>   $ openssl speed rsa   # tests all RSA operations
>   $ openssl dhparam -text 128   # generates DH parameters (128-bit)
> 
> The RSA test is pretty accurate--you can compare this with other
> systems like Linux on a PC.  The DH test is unfortunately very
> random..some runs will be quick and others slow.  You'll have to run
> it many times and with different bit sizes to guage how slow it is.
> Again, comparing to a Linux box may be useful.  You will almost
> defintely see the HP version being much slower than Linux/Intel (on
> Pentium3/Athlon).  This is because in practice the Intel chips seem to
> have much faster integer performance; whereas the PA-RISC is much
> faster with floating point.  Unfortunately for you, most crypto is
> integer based.  Just to give you a comparison point, here's my numbers
> (after optimizing it as described below)...
> 
>                      sign    verify    sign/s verify/s
>    rsa  512 bits   0.0023s   0.0002s    432.2   5402.4
>    rsa 1024 bits   0.0094s   0.0005s    106.8   2132.8
>    rsa 2048 bits   0.0519s   0.0014s     19.3    690.2
>    rsa 4096 bits   0.3258s   0.0049s      3.1    203.4
> 
> Without my changes, even with gcc -O3, my speeds were about 100 times
> slower!  The DH speed is much harder to measure, but it was definitely
> real slow with the gcc compiled version.
> 
> Okay, what's going on inside the OpenSSL code....  there are two small
> functions which are responsible for about 95% of the CPU clock cycles.
> These are bn_mul_add_words() in the file crypto/bn/bn_asm.c and the
> function BN_mod_word() in the file crypto/bn/bn_word.c.  The first is
> responsible for the miserable RSA speeds, and the later for the
> horrible DH speeds.  I'll discuss how to speed each of these up
> separately.
> 
> The bn_mul_add_words() function is by default implemented in the file
> bn_asm.c.  However, neither the gcc or HP C compiler seem to be able
> to optimize that implementation very well.  As that function can be
> called thousands if not millions of times, every last clock cycle is
> extremely important.  Fortunately there is some hand-crafted assembly
> code in an alternate implementation.  It can be found in the OpenSSL
> distribution in the file crypto/bn/asm/pa-risc2.s.  You need to use
> that file instead of the generic bn_asm.c file.  However, there are
> some restrictions...that file only works with HP's assembler (not
> gas), only on PA-RISC 2.0 systems, and it is not relocatable/PIC
> (can't be used in a shared library).
> 
> I haven't completely figured out OpenSSL's non-standard configure
> scripts.  But it is easy enough to just assemble it yourself and then
> replace that object in the libcrypto.a library.
> 
>    ar d libcrypto.a bn_asm.o
>    ar r libcrypto.a pa-risc2.o
>    ranlib libcrypt.a
> 
> Then relink the openssl executable.  Rerun your RSA speed
> test..hopefully the results should be very pleasant.
> 
> 
> Now, for the Diffie-Hellman part (the primary reason for SSH
> slowness).  There is no assembly version of the bn_word.c file.  And
> unfortunately gcc's optimizer, even with gcc 3.1 and with -O3 and
> -march=2.0, is pretty poor.  This basically is because gcc invokes
> some millicode routines to do the 64-bit modulus "%" operation.  I've
> found though that HP's ANSI C compiler with the correct optimization
> arguments is able to produce some PA-RISC 2.0 specific instructions
> which make it very fast in comparison (say by 100 clock cycles).
> 
>    cc +O3 +ESlit +DA2.0 +DS2.0 -Ae \
>       -DOPENSSL_THREADS -D_REENTRANT -DDSO_DL -DOPENSSL_NO_KRB5 \
>       -I/opt/gnu/include \
>       -DOPENSSL_NO_RC5 -DOPENSSL_NO_IDEA -D_REENTRANT \
>       -DB_ENDIAN -DMD32_XARRAY -c bn_word.c -o bn_word.o
> 
> Also throw in +Z if you're trying to make a shared library (but see
> note about pa-risc2.s file above).
> 
> Except for those two files (pa-risc2.s and bn_word.c), you can use gcc
> for everything else.  I've been using gcc 3.1, with -O3 -march=2.0
> 
> Now, if all goes well, you'll have a new libcrypto.a.  Compile and
> link OpenSSH against that one and you should see fast logins, finally!
> Note, both the server (sshd) and the client (ssh) need to be
> recompiled/relinked, as both generate their half of the DH parameters.
> 
> Deron Meranda
> _______________________________________________
> openssh-unix-dev at mindrot.org mailing list
> http://www.mindrot.org/mailman/listinfo/openssh-unix-dev



More information about the openssh-unix-dev mailing list