Nagle & delayed ACK strike again

Fri Dec 22 10:14:40 EST 2006

> > To me it still looks like the use of Nagle is the exception, it has
> > already been turned off in the server for
> > 
> >   - interactive sessions
> 
> For at least some interactive sessions.  In the telnet space at least, 
> there is this constant back and forth happening bewteen wanting 
> keystrokes to be nice and uniform, and not overwhelming slot terminal 
> devices (eg barcode scanners) when applications on the server dump a 
> bunch of stuff down stdio.

For ssh this is unconditional.  I've suggested adding NoDelay/
NoNoDelay options, but somebody on this list vetoed that.

> >   - X11 forwarding
> > 
> > and it will need to be turned off for
> > 
> >   - SFTP transport
> > 
> >   - IP tunnelling
> > 
> >   - ???
> > 
> > Is there any transported protocol where Nagle does make sense?
> 
> Regular FTP is one, anything unidirectional.

Nagle doesn't help FTP or HTTP does it?  Anything that just pushes a
big chunk of data will automatically end up with big packets.

So other than the disputed interactive session, Nagle doesn't seem to
have any positive effects.

> It also depends on what one is trying to optimize.  If one is only 
> interested in optimizing time, Nagle may not be the thing.  However, 
> Nagle can optimize the ratio of data to data+headers and it can optimize 
> the quanity of CPU consumed per unit of data transferred.

For a filesystem protocol obviously latency (and hence throughput) is
the most important factor.

> Some netperf data for the unidirectional case, between a system in Palo 
> Alto and one in Cupertino, sending-side CPU utilization included, 
> similar things can happen to receive-side CPU:
> 
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET
> Recv  Send   Send                        Utilization      Service Demand
> SocketSocket Message Elapsed             Send     Recv    Send    Recv
> Size  Size   Size    Time     Throughput local    remote  local   remote
> bytes bytes  bytes   secs.    10^6bits/s % S      % U     us/KB   us/KB
> 
> 131072 219136    512    10.10      74.59   8.78   -1.00   9.648   -1.000
> 
> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -- -m 512 
> -s 128K -S 128K -D
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay
> Recv   Send   Send                       Utilization      Service Demand
> Socket Socket Message Elapsed            Send     Recv    Send    Recv
> Size   Size   Size    Time    Throughput local    remote  local   remote
> bytes  bytes  bytes   secs.   10^6bits/s % S      % U     us/KB   us/KB
> 
> 131072 219136   512    10.02       69.21  20.56   -1.00   24.335  -1.000
> 
> The multiple concurrent request/response case is more nuanced and 
> difficule to make.  Basically, it is a race between how many small 
> requests (or responses) will be made at one time, the RTT between the 
> systems, the standalone ACK timer on the receiver, and the service time 
> on the receiver.
> 
> Here is some data with netperf TCP_RR between those two systems:
> 
> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR 
> -- -r 128,2048 -b 3
> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : first burst 3
> Local /Remote
> Socket Size  Request Resp. Elapsed Trans.  CPU    CPU    S.dem   S.dem
> Send   Recv  Size    Size  Time    Rate    local  remote local   remote
> bytes  bytes bytes   bytes secs.   per sec % S    % U    us/Tr   us/Tr
> 
> 16384  87380 128     2048  10.00   1106.42 4.74   -1.00  42.852  -1.000
> 32768  32768
> raj at tardy:~/netperf2_work$ src/netperf -H tardy.cup.hp.com -c -t TCP_RR 
> -- -r 128,2048 -b 3 -D
> TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> tardy.cup.hp.com (15.244.56.217) port 0 AF_INET : nodelay : first burst 3
> Local /Remote
> Socket Size  Request Resp. Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv  Size    Size  Time    Rate     local  remote local   remote
> bytes  bytes bytes   bytes secs.   per sec  % S    % U    us/Tr   us/Tr
> 
> 16384  87380 128     2048  10.01   2145.98  10.49  -1.00  48.875  -1.000
> 32768  32768
> 
> 
> Now, setting TCP_NODELAY did indeed produce a big jump in transactions 
> per second.  Notice though how it also resulted in a 14% increase in CPU 
> utilization per transaction.  Clearly the lunch was not free.
> 
> The percentage difference in transactions per second will converge the 
> larger the number of outstanding transactions.  Taking the settings from 
> above, where the first column is the size of the burst in netperf, the 
> second is without TCP_NODELAY set, the third with:
> 
> raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i 
> `src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 128,2048 
> -b $i; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 
> 128,2048 -b $i -D`; done
> 3 1186.40 2218.63
> 6 1952.53 3695.64
> 9 2574.49 4833.47
> 12 3194.71 4856.63
> 15 3388.54 4784.26
> 18 4215.70 5099.52
> 21 4645.97 5170.89
> 24 4918.16 5336.79
> 27 4927.71 5448.78
> 
> If we increase the request size to 256 bytes, and the response to 8192 
> (In all honesty I don't know what sizes sftp might use so I'm making 
> wild guesses) we can see the convergence happen much sooner - it takes 
> fewer of the 8192 byte responses to take the TCP connection to the 
> bandwidth delay product of the link:
> 
> raj at tardy:~/netperf2_work$ for i in 3 6 9 12 15 18 21 24 27; do echo $i 
> `src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 0 -v 0 -- -r 256,8192 
> -b $i -s 128K -S 128K; src/netperf -H tardy.cup.hp.com -t TCP_RR -l 4 -P 
> 0 -v 0 -- -r 256,8192 -s 128K -S 128K -b $i -D`; done
> 3 895.18 1279.38
> 6 1309.11 1405.38
> 9 1395.30 1325.44
> 12 1256.75 1422.01
> 15 1412.39 1413.64
> 18 1400.04 1419.76
> 21 1415.62 1422.79
> 24 1419.56 1420.10
> 27 1422.43 1379.72

In SFTP the WRIYR request/reply sizes are more like 64kB/32B, and the
outstanding transactions are as many as the socket buffers will bear.

The slowdown is clearly due to 50ms outages from delayed ACK, which is
totally broken, the network is just sitting there idle for no good
reason whatsoever.

I can make new traces, but I guess they would be very similar to the
ones I sent last time for the SFTP download case.

Miklos