Nagle & delayed ACK strike again
Rick Jones
rick.jones2 at hp.com
Fri Dec 22 09:42:37 EST 2006
Miklos Szeredi wrote:
>>>>My personal stance is that 99 times out of ten, if an end-user
>>>>application speeds-up when it sets TCP_NODELAY, it implies the end-user
>>>>application is broken and sending "logically associated" data in
>>>>separate send calls.
>>>
>>>
>>>You tell me, is X protocol broken?
>>
>>Likely not - the discrete mouse events which are usually cited as the
>>reason X needs TCP_NODLEAY are not logically associated. Hence that is
>>the 100th situation out of 10 rather then the 99.
>>
>>
>>>Is SFTP broken?
>>
>>Depends - is it writing logically associated data to the connection in
>>more than one send call?
>
>
> No. They are logically separate calls.
>
>
>>>I don't think
>>>SFTP more broken than any other network fs protocol. The slowdown
>>>happens with a stream of WRITE requests and replies. If the requests
>>>weren't acknowledged, there wouldn't be any trouble, but
>>>acknowledgements do make sense for synchronous operation.
>>
>>Do you have some system call traces and/or packet traces we could look
>>at? If the write requests and replies are each single send call they
>>probably qualify as the "X exception"
>
>
> Yes this is the case. It's the symmetric counterpart of a READ
> message pair, where the request is small and the reply is large. In
> that case the client needed TCP_NODELAY to solve the delayed ACK
> interaction problem.
>
> With the WRITE is the opposite, the request is large and the reply is
> small, and now TCP_NODELAY is needed on the server.
>
> In both cases the request and the reply are sent to the socket with a
> single write() call.
>
> To me it still looks like the use of Nagle is the exception, it has
> already been turned off in the server for
>
> - interactive sessions
For at least some interactive sessions. In the telnet space at least,
there is this constant back and forth happening bewteen wanting
keystrokes to be nice and uniform, and not overwhelming slot terminal
devices (eg barcode scanners) when applications on the server dump a
bunch of stuff down stdio.
> - X11 forwarding
>
> and it will need to be turned off for
>
> - SFTP transport
>
> - IP tunnelling
>
> - ???
>
> Is there any transported protocol where Nagle does make sense?
Regular FTP is one, anything unidirectional.
It also depends on what one is trying to optimize. If one is only
interested in optimizing time, Nagle may not be the thing. However,
Nagle can optimize the ratio of data to data+headers and it can optimize
the quanity of CPU consumed per unit of data transferred.
Some netperf data for the unidirectional case, between a system in Palo
Alto and one in Cupertino, sending-side CPU utilization included,
similar things can happen to receive-side CPU:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET
Recv Send Send Utilization Service Demand
SocketSocket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
131072 219136 512 10.10 74.59 8.78 -1.00 9.648 -1.000
raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -- -m 512
-s 128K -S 128K -D
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
131072 219136 512 10.02 69.21 20.56 -1.00 24.335 -1.000
The multiple concurrent request/response case is more nuanced and
difficule to make. Basically, it is a race between how many small
requests (or responses) will be made at one time, the RTT between the
systems, the standalone ACK timer on the receiver, and the service time
on the receiver.
Here is some data with netperf TCP_RR between those two systems:
raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR
-- -r 128,2048 -b 3
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : first burst 3
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr
16384 87380 128 2048 10.00 1106.42 4.74 -1.00 42.852 -1.000
32768 32768
raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR
-- -r 128,2048 -b 3 -D
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay : first burst 3
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % U us/Tr us/Tr
16384 87380 128 2048 10.01 2145.98 10.49 -1.00 48.875 -1.000
32768 32768
Now, setting TCP_NODELAY did indeed produce a big jump in transactions
per second. Notice though how it also resulted in a 14% increase in CPU
utilization per transaction. Clearly the lunch was not free.
The percentage difference in transactions per second will converge the
larger the number of outstanding transactions. Taking the settings from
above, where the first column is the size of the burst in netperf, the
second is without TCP_NODELAY set, the third with:
raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i
`src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 128,2048
-b $i; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r
128,2048 -b $i -D`; done
3 1186.40 2218.63
6 1952.53 3695.64
9 2574.49 4833.47
12 3194.71 4856.63
15 3388.54 4784.26
18 4215.70 5099.52
21 4645.97 5170.89
24 4918.16 5336.79
27 4927.71 5448.78
If we increase the request size to 256 bytes, and the response to 8192
(In all honesty I don't know what sizes sftp might use so I'm making
wild guesses) we can see the convergence happen much sooner - it takes
fewer of the 8192 byte responses to take the TCP connection to the
bandwidth delay product of the link:
raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i
`src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 256,8192
-b $i -s 128K -S 128K; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P
0 -v 0 -- -r 256,8192 -s 128K -S 128K -b $i -D`; done
3 895.18 1279.38
6 1309.11 1405.38
9 1395.30 1325.44
12 1256.75 1422.01
15 1412.39 1413.64
18 1400.04 1419.76
21 1415.62 1422.79
24 1419.56 1420.10
27 1422.43 1379.72
rick jones
More information about the openssh-unix-dev
mailing list