[netflow-tools] Softflowd and rrd running on the same machine

Tue May 3 01:02:20 EST 2005

Hello Damien

Thank you very much for your response. Although I've still got a few
unclear issues, at least now I cleared some of them and I've got plenty
to try for the next few days:). See a few more comments below.

Best regards
Bogdan

> -----Original Message-----
> From: Damien Miller [mailto:djm at mindrot.org]
> Sent: 30 April 2005 00:12
> To: Bogdan Ghita
> Cc: netflow-tools at mindrot.org
> Subject: Re: [netflow-tools] Softflowd and rrd running on the same
machine
> 
> Bogdan Ghita wrote:
> > Hello everybody
> >
> > I've installed a few days ago softflowd and rrd-associated tools
> > (flow-capture/flowscan) on a monitoring machine connected to a local
> > network. First, I'll start with the praise - it's a great software,
I've
> > been looking for a long time at netflow-related software, but
couldn't
> > find anything that would allow me to implement it via a monitoring
> > machine. Everything seems to be working reasonably well at the
moment,
> > but I still have a couple of problems that I can't find the solution
> > for:
> >
> > - out-of-order packets. In order to get softflowd and rrd to work,
I'm
> > sending packets via the local interface of the machine. I thought
this
> > would work just fine but flow-capture reports continually 'lost'
> > packets:
> >
> > Apr 29 11:27:18 linux flow-capture[23080]: ftpdu_seq_check():
> > src_ip=xxx.xxx.xxx.xxx dst_ip=xxx.xxx.xxx.xxx d_version=5
> > expecting=28199800 received=28199770 lost=4294967265
> > Apr 29 11:27:18 linux flow-capture[23080]: ftpdu_seq_check():
> > src_ip=xxx.xxx.xxx.xxx dst_ip=xxx.xxx.xxx.xxx d_version=5
> > expecting=28199800 received=28199770 lost=4294967265
> 
> This might be a bug in the sequence number generation of softflowd, or
a
> bug in flow-capture - flow-capture is certainly printing negative
> sequence number offsets incorrectly.
> 
> Can you capture some softflowd output packages with another netflow
> capable tool to manually check the sequence numbers? E.g. tcpdump's
> cnfp mode, flowd, etc.

Thank you for the tcdpdump tip - although I'm using it quite often, I
wasn't aware that it can decode Netflow packets. I've looked at it and I
got the following trace:

15:00:51.234336 IP (tos 0x0, ttl  64, id 37603, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261557.484 uptime,
1115042451.234285000, #92222382, 29 recs
15:00:51.235204 IP (tos 0x0, ttl  64, id 37604, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261557.484 uptime,
1115042451.234285000, #92222382, 29 recs
[lines deleted...]
15:00:51.277266 IP (tos 0x0, ttl  64, id 37656, offset 0, flags [DF],
length: 1492) tester.32782 > tester.2000: NetFlow v5, 261557.484 uptime,
1115042451.234285000, #92222382, 30 recs
15:00:51.277933 IP (tos 0x0, ttl  64, id 37657, offset 0, flags [DF],
length: 1492) tester.32782 > tester.2000: NetFlow v5, 261557.484 uptime,
1115042451.234285000, #92222382, 30 recs
15:01:01.242400 IP (tos 0x0, ttl  64, id 37718, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261567.493 uptime,
1115042461.242348000, #92227192, 29 recs
15:01:01.243251 IP (tos 0x0, ttl  64, id 37719, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261567.493 uptime,
1115042461.242348000, #92227192, 29 recs
[lines deleted...]
15:01:01.322057 IP (tos 0x0, ttl  64, id 37835, offset 0, flags [DF],
length: 1492) tester.32782 > tester.2000: NetFlow v5, 261567.493 uptime,
1115042461.242348000, #92227192, 30 recs
15:01:01.322110 IP (tos 0x0, ttl  64, id 37836, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261567.493 uptime,
1115042461.242348000, #92227192, 29 recs
15:01:01.322157 IP (tos 0x0, ttl  64, id 37837, offset 0, flags [DF],
length: 772) tester.32782 > tester.2000: NetFlow v5, 261567.493 uptime,
1115042461.242348000, #92227192, 15 recs
15:01:11.230365 IP (tos 0x0, ttl  64, id 37838, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968, 29 recs
15:01:11.231270 IP (tos 0x0, ttl  64, id 37839, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968, 29 recs
15:01:11.232049 IP (tos 0x0, ttl  64, id 37840, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968, 29 recs
[lines deleted...]
15:01:11.276308 IP (tos 0x0, ttl  64, id 37952, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968, 29 recs
15:01:11.276335 IP (tos 0x0, ttl  64, id 37953, offset 0, flags [DF],
length: 1492) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968, 30 recs
15:01:11.276353 IP (tos 0x0, ttl  64, id 37954, offset 0, flags [DF],
length: 340) tester.32782 > tester.2000: NetFlow v5, 261577.480 uptime,
1115042471.230315000, #92231968,  6 recs
15:01:21.225781 IP (tos 0x0, ttl  64, id 37955, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261587.476 uptime,
1115042481.225734000, #92236610, 29 recs
15:01:21.226594 IP (tos 0x0, ttl  64, id 37956, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261587.476 uptime,
1115042481.225734000, #92236610, 29 recs
15:01:21.227292 IP (tos 0x0, ttl  64, id 37957, offset 0, flags [DF],
length: 1444) tester.32782 > tester.2000: NetFlow v5, 261587.476 uptime,
1115042481.225734000, #92236610, 29 recs

I've looked at print-cnfp.c within tcpdump and the # values are the
sequence numbers within Netflow. I am not sure why they remain the same
between datagrams (are they supposed to? the specification says that
they should be increased with the corresponding number of flows every
time a new datagram arrives), but one thing is sure - there is no gap in
the id numbers, so no datagrams are lost in the process. Related to
this, I've changed the expiry interval to 10 seconds, hoping to get more
granularity (and less loss, due to reduced bursts) - has anybody
obtained better/worse results while varying the expiry interval?

> 
> > Is it possible (I've don very little in terms of socket programming,
so
> > the answer might be straightforward) to pipe softflowd straight into
> > flow-capture, so that the communication will be (hopefully)
smoother?
> 
> It is highly unlikely that the packets are acually getting dropped. It
> is more likely a bug in one/both packages.

Judging by the trace above (no gaps in the id numbers), I am more
tempted to believe there is (at least) some sort of misinterpretation,
if not a bug.

> 
> > - CPU usage - softflowd seems to behave very strange when changing
the
> > priority - with the default priority (0), it uses continuously about
80%
> > of the processor (yes, it is a big network and, again, yes, it is a
slow
> > machine - P3-900MHz); when I renice it (via 'top') to lower priority
> > (e.g. 10), the utilisation drops to about 20%, although the
processor
> > is, overall, only about 50% used. Is the collector dropping
packets/flow
> > during that time? Is the machine that slow?
> 
> It is difficult to tell whether or not you are dropping packets - it
> depend a lot on what happens on the kernel side too. I'm not sure what
> information on packet drops libpcap makes available, but the attached
> patch adds printing of the statistics that it does provide to those
> printable using "softflowctl statistics".
> 
> I'm not sure what a "dropped packet" in libpcap's parlance is - it may
> be a drop because the client couldn't keep up, or a drop from a bpf
> filter program. It should be obvious once you start playing with it
> though :)

Thank you very much for the patch. I will apply it on later today and
come back with the details.

> 
> > - Reports - I am not sure whether this is a softflowd problem or an
> > rrd-related problem. I've noticed a continuous udp flow running over
the
> > network, quite considerable in terms of bandwidth. However, when
drawing
> > the graphs, there is only an hourly spike, and nothing else. What
could
> > be causing this type of reporting?
> 
> That depends on how the data is being presented - if you are showing
> flow records vs time, then a long UDP conversation may be represented
by
> only a single flow record. Retrospectively producing throughput vs
time
> representations of long-lived flow conversations is a little tricky
and
> requires some suppositions, as the flow record summarises away any
> dynamics in the conversation.
> 

I got this one, understood. 

> -d