HPN Patch for OpenSSH 4.2p1 Available

Mon Oct 10 20:07:44 EST 2005

Chris Rapier wrote:
> The performance hit is, initially somewhat suprising, but on reflection 
> I think this is mostly going to be dependent on how the systems TCP 
> stacks are tuned. I can see performance decreasing in the LAN if the 
> receive buffer is too high. An easy test would be to run some iPerfs and 
> see if the system TCP receive buffer shows a similar performance hit 
> versus setting the the iPerf window to 64k. I'll run some similar tests 
> to see if I'm right.

I think that what's happening is that on the source side, the buffers 
are growing larger than is necessary and ssh is spending 
disproportionately more time managing them.  This would explain why I 
saw more impact for faster cpus: they can fill the buffers faster and 
thus are affected more.

Going back to my earlier data:
	-current	-hpn11
real    2m40.966s    2m57.296s    (10.2% slower)
user    0m33.187s    0m45.820s
sys     0m31.500s    0m37.539s

Note that there is a significant increase in user time.  If the 
difference was purely due to stack tuning, I'd expect user time to be 
similar.

I'm also going to profile ssh, but my bet is that extra time is in the 
buffer functions.

> However, this should only be an issue with non-autotuning kernels. 
> Autotuning kernels such as linux 2.4.26+. 2.6+, and Windows Vista (nee 
> Longhorn) will adjust the receive buffer to maximize throughput. Since 
> HPN-SSH is auto-tuning aware this shouldn't be a problem on those 
> systems. On non-autotuning kernels appropriate use of the -w option 
> should resolve this.

The Linux kernel used for the test was 2.6.12-1.1376_FC3.

> Again, I'll need to test this but I'm pretty sure thats the issue.
> 
> Darren Tucker wrote:
>> Chris Rapier wrote:
>>
>>> As a note, we now have HPN patch for OpenSSH 4.2 at 
>>> http://www.psc.edu/networking/projects/hpn-ssh/
>>>
>>> Its still part of the last set of patches (HPN11) so there aren't any 
>>> additional changes in the code. It patches, configures, compiles, and 
>>> passes make tests without a problem. I've not done extensive testing 
>>> for this version of openssh but I don't foresee any problems.
>>>
>>> I did run a couple tests between two patched 4.2 installs (one in 
>>> switzerland, the other in pennsylvania, usa) and hit around 12MB/s 
>>> with the hpn patch and 500KB/s with the standard install. So it still 
>>> seems to work as expected.
>>
>>
>> Have you done any analysis of the impact of the patch on 
>> low-latency/BDP links (ie LANs)?
>>
>> I've been doing some work with parts of the HPN patch.  For scp'ing to 
>> and from localhost and LANs (see below), the scp changes on their own 
>> (actually, a variant thereof) shows a modest improvement of 2-3% in 
>> throughput.  That's good.
>>
>> For the entire patch (hpn11), however, it shows a significant 
>> *decrease* in throughput for the same tests: 10% slower on OpenBSD to 
>> localhost, 12% slower on Linux to localhost and 18% slower Linux to 
>> OpenBSD via a 100Mb/s LAN).  That's bad.  I would imagine LANs are 
>> more common than high BDP networks that your patch targets :-)
> 
> I'll check this of course. We don't have any OpenBSD systems though but 
> I'll try to find a spare box we can blow it on to.

Thanks.  I'm most interested in the Linux results at this point, with 
and without the stack tuning.

>> I suspect that the buffer size increases you do are counterproductive 
>> for such networks.  I also suspect that you could get the same 
>> performance improvements on high-BDP links as you currently do by 
>> simply increasing the channel advertisements without the SSH buffer 
>> changes and relying on the TCP socket buffer for output and decrypting 
>> quickly for input but I'm not able to test that.
> 
> Of course is possible to increase performance through other means. 
> Transferring data in parallel for example, is a great way to do this. In 
> fact the suggestions you made are ones we were planning on implementing 
> in addition to the buffer hack. Especially now that we really are CPU 
> limited as opposed to being buffer limited.
> 
> However, that seems, in my view at least, to be an overly complicated 
> way of going about it. The buffer hack is pretty straightforward and 
> well known - the concept is laid out in Stevens TCP/IP Illustrated 
> Volume 1 after all.

I'll have to dig out my copy and read that bit :-)

>> Test method, with a 64MB file:
>> $ time for i in 1 2 3 4 5 6 7 8 9 0; do scp -ocompression=no -o 
>> ciphers=arcfour /tmp/tmp localhost:/dev/null; done
> 
> I'll try this out and let you know what I find. Could you let me know 
> what you had your tcp receive buffer set to when you tried these tests? 
> Optimally for these local tests it should be set to 64k.

On OpenBSD: the default (16k send and recv).

On Linux, whatever this decodes to:
$ sysctl -a |egrep 'tcp.*mem'
net.ipv4.tcp_rmem = 4096        87380   174760
net.ipv4.tcp_wmem = 4096        16384   131072
net.ipv4.tcp_mem = 98304        131072  196608

> By the way - I just ran the above test and didn't see any sort of 
> substantive difference. There was a slight edge to the HPN platform 
> *but* that was probably due to the scp hack.
> 
> Of course, in this case the problem is likely to be disk speed limits 
> (I'm on a powerbook at the moment and its disks are slow). Bypassing scp 
> and doing a direct memory to memory copy is probably the right methodology.

The source files for the Athlon were from RAM (/tmp mounted on mfs).

The others were from local disk (reiserfs for Linux system, ufs 
w/softdep for the OpenBSD).

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
     Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.