HPN Patch for OpenSSH 4.2p1 Available

Tue Oct 11 01:51:10 EST 2005

Here is what my test looked like. I used /dev/zero as a data source and 
/dev/null as a sink to avoid any disk IO issues. Compression was set to 
no to avoid any data compression on the stream. Cipher was RC4.
A standard 4.1 server was running on port 22228 and a 4.1HPN server was 
running on 22229. The machine is a dual xeon 2ghz box with 1gb of ram 
running linux 2.4.29 and the receive buffer set to 4MB

The command used was:
time for i in 1 2 3 4 5 6 7 8 9 0; do head -c 100000000 /dev/zero | 
./ssh -p 2222[8|9] -ocompression=no -o ciphers=arcfour localhost "cat - 
 > /dev/null"; done

4.1p1 -> 4.1p1
real    0m22.300s
user    0m18.160s
sys     0m4.790s

4.1 -> 4.1HPN
real    0m21.001s
user    0m16.590s
sys     0m4.790s

4.1HPN -> 4.1
real    0m21.982s
user    0m17.380s
sys     0m4.950s

4.1HPN -> 4.1HPN
real    0m20.557s
user    0m16.380s
sys     0m4.770s

Which is what I'd expect from a localhost connection. There doesn't seem 
to be any really statistically significant variation. Of course, its 
only a run of ten connections so I'll be rerunning it with a a couple 
hundred connections. I'll see what that looks like. I might run 
individual times on each connection and get some SD values just for 
curiosities sake.

Now, here is the same test running between two machines connected via 
GigE on the same LAN (different subnets though average RTT of 0.859ms).
The source machine I believe is single CPU Xeon 2.4Ghz, 1GB ram, linux 
2.6.13. The sink is the machine referenced above.

4.1p1 -> 4.1p1
real    0m29.528s
user    0m17.703
sys     0m4.276s

4.1 -> 4.1HPN
real    0m22.874s
user    0m17.202s
sys     0m4.553s

4.1HPN -> 4.1
real    0m28.942s
user    0m17.370s
sys     0m4.134s

4.1HPN -> 4.1HPN
real    0m22.315s
user    0m16.621s
sys     0m4.614s

So where you saw a performance decrease I'm seeing an improvement of 
roughly 22%. I'm going to rerun these all these tests with more 
iterations but right now I'm, unfortunately, not seeing the same 
problems you are. Unfortunate because it just makes this a more 
complicated question :\

My gut feeling is that the tests, while somewhat different in details, 
are roughly equivilant in method. However, maybe there is a problem with 
my methodology. Let me know if you see anything.

Chris

Darren Tucker wrote:
> Chris Rapier wrote:
> 
>>The performance hit is, initially somewhat suprising, but on reflection 
>>I think this is mostly going to be dependent on how the systems TCP 
>>stacks are tuned. I can see performance decreasing in the LAN if the 
>>receive buffer is too high. An easy test would be to run some iPerfs and 
>>see if the system TCP receive buffer shows a similar performance hit 
>>versus setting the the iPerf window to 64k. I'll run some similar tests 
>>to see if I'm right.
> 
> 
> I think that what's happening is that on the source side, the buffers 
> are growing larger than is necessary and ssh is spending 
> disproportionately more time managing them.  This would explain why I 
> saw more impact for faster cpus: they can fill the buffers faster and 
> thus are affected more.
> 
> Going back to my earlier data:
> 	-current	-hpn11
> real    2m40.966s    2m57.296s    (10.2% slower)
> user    0m33.187s    0m45.820s
> sys     0m31.500s    0m37.539s
> 
> Note that there is a significant increase in user time.  If the 
> difference was purely due to stack tuning, I'd expect user time to be 
> similar.
> 
> I'm also going to profile ssh, but my bet is that extra time is in the 
> buffer functions.
> 
> 
>>However, this should only be an issue with non-autotuning kernels. 
>>Autotuning kernels such as linux 2.4.26+. 2.6+, and Windows Vista (nee 
>>Longhorn) will adjust the receive buffer to maximize throughput. Since 
>>HPN-SSH is auto-tuning aware this shouldn't be a problem on those 
>>systems. On non-autotuning kernels appropriate use of the -w option 
>>should resolve this.
> 
> 
> The Linux kernel used for the test was 2.6.12-1.1376_FC3.
> 
> 
>>Again, I'll need to test this but I'm pretty sure thats the issue.
>>
>>Darren Tucker wrote:
>>
>>>Chris Rapier wrote:
>>>
>>>
>>>>As a note, we now have HPN patch for OpenSSH 4.2 at 
>>>>http://www.psc.edu/networking/projects/hpn-ssh/
>>>>
>>>>Its still part of the last set of patches (HPN11) so there aren't any 
>>>>additional changes in the code. It patches, configures, compiles, and 
>>>>passes make tests without a problem. I've not done extensive testing 
>>>>for this version of openssh but I don't foresee any problems.
>>>>
>>>>I did run a couple tests between two patched 4.2 installs (one in 
>>>>switzerland, the other in pennsylvania, usa) and hit around 12MB/s 
>>>>with the hpn patch and 500KB/s with the standard install. So it still 
>>>>seems to work as expected.
>>>
>>>
>>>Have you done any analysis of the impact of the patch on 
>>>low-latency/BDP links (ie LANs)?
>>>
>>>I've been doing some work with parts of the HPN patch.  For scp'ing to 
>>>and from localhost and LANs (see below), the scp changes on their own 
>>>(actually, a variant thereof) shows a modest improvement of 2-3% in 
>>>throughput.  That's good.
>>>
>>>For the entire patch (hpn11), however, it shows a significant 
>>>*decrease* in throughput for the same tests: 10% slower on OpenBSD to 
>>>localhost, 12% slower on Linux to localhost and 18% slower Linux to 
>>>OpenBSD via a 100Mb/s LAN).  That's bad.  I would imagine LANs are 
>>>more common than high BDP networks that your patch targets :-)
>>
>>I'll check this of course. We don't have any OpenBSD systems though but 
>>I'll try to find a spare box we can blow it on to.
> 
> 
> Thanks.  I'm most interested in the Linux results at this point, with 
> and without the stack tuning.
> 
> 
>>>I suspect that the buffer size increases you do are counterproductive 
>>>for such networks.  I also suspect that you could get the same 
>>>performance improvements on high-BDP links as you currently do by 
>>>simply increasing the channel advertisements without the SSH buffer 
>>>changes and relying on the TCP socket buffer for output and decrypting 
>>>quickly for input but I'm not able to test that.
>>
>>Of course is possible to increase performance through other means. 
>>Transferring data in parallel for example, is a great way to do this. In 
>>fact the suggestions you made are ones we were planning on implementing 
>>in addition to the buffer hack. Especially now that we really are CPU 
>>limited as opposed to being buffer limited.
>>
>>However, that seems, in my view at least, to be an overly complicated 
>>way of going about it. The buffer hack is pretty straightforward and 
>>well known - the concept is laid out in Stevens TCP/IP Illustrated 
>>Volume 1 after all.
> 
> 
> I'll have to dig out my copy and read that bit :-)
> 
> 
>>>Test method, with a 64MB file:
>>>$ time for i in 1 2 3 4 5 6 7 8 9 0; do scp -ocompression=no -o 
>>>ciphers=arcfour /tmp/tmp localhost:/dev/null; done
>>
>>I'll try this out and let you know what I find. Could you let me know 
>>what you had your tcp receive buffer set to when you tried these tests? 
>>Optimally for these local tests it should be set to 64k.
> 
> 
> On OpenBSD: the default (16k send and recv).
> 
> On Linux, whatever this decodes to:
> $ sysctl -a |egrep 'tcp.*mem'
> net.ipv4.tcp_rmem = 4096        87380   174760
> net.ipv4.tcp_wmem = 4096        16384   131072
> net.ipv4.tcp_mem = 98304        131072  196608
> 
> 
>>By the way - I just ran the above test and didn't see any sort of 
>>substantive difference. There was a slight edge to the HPN platform 
>>*but* that was probably due to the scp hack.
>>
>>Of course, in this case the problem is likely to be disk speed limits 
>>(I'm on a powerbook at the moment and its disks are slow). Bypassing scp 
>>and doing a direct memory to memory copy is probably the right methodology.
> 
> 
> The source files for the Athlon were from RAM (/tmp mounted on mfs).
> 
> The others were from local disk (reiserfs for Linux system, ufs 
> w/softdep for the OpenBSD).
>