Nagle & delayed ACK strike again
Chris Rapier
rapier at psc.edu
Fri Dec 22 12:51:49 EST 2006
I'm assuming that the network in question has a 1500B MTU. Does anything
change if the MTU is increased to 9k?
Miklos Szeredi wrote:
>>> To me it still looks like the use of Nagle is the exception, it has
>>> already been turned off in the server for
>>>
>>> - interactive sessions
>> For at least some interactive sessions. In the telnet space at least,
>> there is this constant back and forth happening bewteen wanting
>> keystrokes to be nice and uniform, and not overwhelming slot terminal
>> devices (eg barcode scanners) when applications on the server dump a
>> bunch of stuff down stdio.
>
> For ssh this is unconditional. I've suggested adding NoDelay/
> NoNoDelay options, but somebody on this list vetoed that.
>
>>> - X11 forwarding
>>>
>>> and it will need to be turned off for
>>>
>>> - SFTP transport
>>>
>>> - IP tunnelling
>>>
>>> - ???
>>>
>>> Is there any transported protocol where Nagle does make sense?
>> Regular FTP is one, anything unidirectional.
>
> Nagle doesn't help FTP or HTTP does it? Anything that just pushes a
> big chunk of data will automatically end up with big packets.
>
> So other than the disputed interactive session, Nagle doesn't seem to
> have any positive effects.
>
>> It also depends on what one is trying to optimize. If one is only
>> interested in optimizing time, Nagle may not be the thing. However,
>> Nagle can optimize the ratio of data to data+headers and it can optimize
>> the quanity of CPU consumed per unit of data transferred.
>
> For a filesystem protocol obviously latency (and hence throughput) is
> the most important factor.
>
>> Some netperf data for the unidirectional case, between a system in Palo
>> Alto and one in Cupertino, sending-side CPU utilization included,
>> similar things can happen to receive-side CPU:
>>
>> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET
>> Recv Send Send Utilization Service Demand
>> SocketSocket Message Elapsed Send Recv Send Recv
>> Size Size Size Time Throughput local remote local remote
>> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>>
>> 131072 219136 512 10.10 74.59 8.78 -1.00 9.648 -1.000
>>
>> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -- -m 512
>> -s 128K -S 128K -D
>> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay
>> Recv Send Send Utilization Service Demand
>> Socket Socket Message Elapsed Send Recv Send Recv
>> Size Size Size Time Throughput local remote local remote
>> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>>
>> 131072 219136 512 10.02 69.21 20.56 -1.00 24.335 -1.000
>>
>> The multiple concurrent request/response case is more nuanced and
>> difficule to make. Basically, it is a race between how many small
>> requests (or responses) will be made at one time, the RTT between the
>> systems, the standalone ACK timer on the receiver, and the service time
>> on the receiver.
>>
>> Here is some data with netperf TCP_RR between those two systems:
>>
>> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR
>> -- -r 128,2048 -b 3
>> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : first burst 3
>> Local /Remote
>> Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
>> Send Recv Size Size Time Rate local remote local remote
>> bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr
>>
>> 16384 87380 128 2048 10.00 1106.42 4.74 -1.00 42.852 -1.000
>> 32768 32768
>> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR
>> -- -r 128,2048 -b 3 -D
>> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay : first burst 3
>> Local /Remote
>> Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
>> Send Recv Size Size Time Rate local remote local remote
>> bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr
>>
>> 16384 87380 128 2048 10.01 2145.98 10.49 -1.00 48.875 -1.000
>> 32768 32768
>>
>>
>> Now, setting TCP_NODELAY did indeed produce a big jump in transactions
>> per second. Notice though how it also resulted in a 14% increase in CPU
>> utilization per transaction. Clearly the lunch was not free.
>>
>> The percentage difference in transactions per second will converge the
>> larger the number of outstanding transactions. Taking the settings from
>> above, where the first column is the size of the burst in netperf, the
>> second is without TCP_NODELAY set, the third with:
>>
>> raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i
>> `src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 128,2048
>> -b $i; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r
>> 128,2048 -b $i -D`; done
>> 3 1186.40 2218.63
>> 6 1952.53 3695.64
>> 9 2574.49 4833.47
>> 12 3194.71 4856.63
>> 15 3388.54 4784.26
>> 18 4215.70 5099.52
>> 21 4645.97 5170.89
>> 24 4918.16 5336.79
>> 27 4927.71 5448.78
>>
>> If we increase the request size to 256 bytes, and the response to 8192
>> (In all honesty I don't know what sizes sftp might use so I'm making
>> wild guesses) we can see the convergence happen much sooner - it takes
>> fewer of the 8192 byte responses to take the TCP connection to the
>> bandwidth delay product of the link:
>>
>> raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i
>> `src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 256,8192
>> -b $i -s 128K -S 128K; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P
>> 0 -v 0 -- -r 256,8192 -s 128K -S 128K -b $i -D`; done
>> 3 895.18 1279.38
>> 6 1309.11 1405.38
>> 9 1395.30 1325.44
>> 12 1256.75 1422.01
>> 15 1412.39 1413.64
>> 18 1400.04 1419.76
>> 21 1415.62 1422.79
>> 24 1419.56 1420.10
>> 27 1422.43 1379.72
>
> In SFTP the WRIYR request/reply sizes are more like 64kB/32B, and the
> outstanding transactions are as many as the socket buffers will bear.
>
> The slowdown is clearly due to 50ms outages from delayed ACK, which is
> totally broken, the network is just sitting there idle for no good
> reason whatsoever.
>
> I can make new traces, but I guess they would be very similar to the
> ones I sent last time for the SFTP download case.
>
> Miklos
> _______________________________________________
> openssh-unix-dev mailing list
> openssh-unix-dev at mindrot.org
> http://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
More information about the openssh-unix-dev
mailing list