Nagle & delayed ACK strike again

Fri Dec 22 09:42:37 EST 2006

Miklos Szeredi wrote:
>>>>My personal stance is that 99 times out of ten, if an end-user 
>>>>application speeds-up when it sets TCP_NODELAY, it implies the end-user 
>>>>application is broken and sending "logically associated" data in 
>>>>separate send calls.
>>>
>>>
>>>You tell me, is X protocol broken?  
>>
>>Likely not - the discrete mouse events which are usually cited as the 
>>reason X needs TCP_NODLEAY are not logically associated.  Hence that is 
>>the 100th situation out of 10 rather then the 99.
>>
>>
>>>Is SFTP broken? 
>>
>>Depends - is it writing logically associated data to the connection in 
>>more than one send call?
> 
> 
> No.  They are logically separate calls.
> 
> 
>>>I don't think
>>>SFTP more broken than any other network fs protocol.  The slowdown
>>>happens with a stream of WRITE requests and replies.  If the requests
>>>weren't acknowledged, there wouldn't be any trouble, but
>>>acknowledgements do make sense for synchronous operation.
>>
>>Do you have some system call traces and/or packet traces we could look 
>>at?  If the write requests and replies are each single send call they 
>>probably qualify as the "X exception"
> 
> 
> Yes this is the case.  It's the symmetric counterpart of a READ
> message pair, where the request is small and the reply is large.  In
> that case the client needed TCP_NODELAY to solve the delayed ACK
> interaction problem.
> 
> With the WRITE is the opposite, the request is large and the reply is
> small, and now TCP_NODELAY is needed on the server.
> 
> In both cases the request and the reply are sent to the socket with a
> single write() call.
> 
> To me it still looks like the use of Nagle is the exception, it has
> already been turned off in the server for
> 
>   - interactive sessions

For at least some interactive sessions.  In the telnet space at least, 
there is this constant back and forth happening bewteen wanting 
keystrokes to be nice and uniform, and not overwhelming slot terminal 
devices (eg barcode scanners) when applications on the server dump a 
bunch of stuff down stdio.

>   - X11 forwarding
> 
> and it will need to be turned off for
> 
>   - SFTP transport
> 
>   - IP tunnelling
> 
>   - ???
> 
> Is there any transported protocol where Nagle does make sense?

Regular FTP is one, anything unidirectional.

It also depends on what one is trying to optimize.  If one is only 
interested in optimizing time, Nagle may not be the thing.  However, 
Nagle can optimize the ratio of data to data+headers and it can optimize 
the quanity of CPU consumed per unit of data transferred.

Some netperf data for the unidirectional case, between a system in Palo 
Alto and one in Cupertino, sending-side CPU utilization included, 
similar things can happen to receive-side CPU:

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET
Recv  Send   Send                        Utilization      Service Demand
SocketSocket Message Elapsed             Send     Recv    Send    Recv
Size  Size   Size    Time     Throughput local    remote  local   remote
bytes bytes  bytes   secs.    10^6bits/s % S      % U     us/KB   us/KB

131072 219136    512    10.10      74.59   8.78   -1.00   9.648   -1.000

raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -- -m 512 
-s 128K -S 128K -D
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay
Recv   Send   Send                       Utilization      Service Demand
Socket Socket Message Elapsed            Send     Recv    Send    Recv
Size   Size   Size    Time    Throughput local    remote  local   remote
bytes  bytes  bytes   secs.   10^6bits/s % S      % U     us/KB   us/KB

131072 219136   512    10.02       69.21  20.56   -1.00   24.335  -1.000

The multiple concurrent request/response case is more nuanced and 
difficule to make.  Basically, it is a race between how many small 
requests (or responses) will be made at one time, the RTT between the 
systems, the standalone ACK timer on the receiver, and the service time 
on the receiver.

Here is some data with netperf TCP_RR between those two systems:

raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR 
-- -r 128,2048 -b 3
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : first burst 3
Local /Remote
Socket Size  Request Resp. Elapsed Trans.  CPU    CPU    S.dem   S.dem
Send   Recv  Size    Size  Time    Rate    local  remote local   remote
bytes  bytes bytes   bytes secs.   per sec % S    % U    us/Tr   us/Tr

16384  87380 128     2048  10.00   1106.42 4.74   -1.00  42.852  -1.000
32768  32768
raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR 
-- -r 128,2048 -b 3 -D
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay : first burst 3
Local /Remote
Socket Size  Request Resp. Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv  Size    Size  Time    Rate     local  remote local   remote
bytes  bytes bytes   bytes secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380 128     2048  10.01   2145.98  10.49  -1.00  48.875  -1.000
32768  32768

Now, setting TCP_NODELAY did indeed produce a big jump in transactions 
per second.  Notice though how it also resulted in a 14% increase in CPU 
utilization per transaction.  Clearly the lunch was not free.

The percentage difference in transactions per second will converge the 
larger the number of outstanding transactions.  Taking the settings from 
above, where the first column is the size of the burst in netperf, the 
second is without TCP_NODELAY set, the third with:

raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i 
`src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 128,2048 
-b $i; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 
128,2048 -b $i -D`; done
3 1186.40 2218.63
6 1952.53 3695.64
9 2574.49 4833.47
12 3194.71 4856.63
15 3388.54 4784.26
18 4215.70 5099.52
21 4645.97 5170.89
24 4918.16 5336.79
27 4927.71 5448.78

If we increase the request size to 256 bytes, and the response to 8192 
(In all honesty I don't know what sizes sftp might use so I'm making 
wild guesses) we can see the convergence happen much sooner - it takes 
fewer of the 8192 byte responses to take the TCP connection to the 
bandwidth delay product of the link:

raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i 
`src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 256,8192 
-b $i -s 128K -S 128K; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 
0 -v 0 -- -r 256,8192 -s 128K -S 128K -b $i -D`; done
3 895.18 1279.38
6 1309.11 1405.38
9 1395.30 1325.44
12 1256.75 1422.01
15 1412.39 1413.64
18 1400.04 1419.76
21 1415.62 1422.79
24 1419.56 1420.10
27 1422.43 1379.72

rick jones