[netflow-tools] softflowd keeps crashing

Mon Apr 13 01:03:19 EST 2009

> On Fri, 27 Mar 2009, alex k wrote:
>
>> hi there,
>>
>> first of all, softflowd is a cool piece of software.
>> we have it on other linux machines (gateways) and it runs perfectly
>> stable
>> there.
>> i use softflowd to collect data and nfsen to capture and evaluate.
>>
>> but there is one host, where softflowd keeps crashing.
>> i am a bit clueless as instability doesn't seem to be a problem of
>> softflowd.
>> at least i didn't find anything in the web or this list.
>>
>> some information about the host:
>>
>> kernel (64 bit):
>> 2.6.27.7-9-default #1 SMP 2008-12-04 18:10:04 +0100 x86_64 x86_64 x86_64
>> GNU/Linux
>> libpcap version:
>> libpcap0-0.9.8-47.41
>> softflowd version:
>> softflowd-0.9.8 (compiled without problems on that machine)
>>
>> on this host with one network card runs vmware-server with several
>> guests.
>> the guests use bridged networking, every has its own ip address, but as
>> mentioned - there is only one network card.
>>
>> softflowd crashes occasionally. sometimes once in two weeks, sometimes
>> twice a day.
>> the process disappears, the pid file stays.
>>
>> the only thing i recognized is, that at the same time there are often
>> flows with completely wrong date (about 6 weeks in the future).
>> not exactly the same time, of course. when softflowd crashes, the
>> possibly
>> critical data is lost.
>>
>> so my questions are:
>> 1) where does softflowd get its time from?
>> 2) can the wrong time be a problem?
>> 3) what else could cause the crashes
>>
>> especially: how can i find it out.
>> softflowd is _very_ quiet. nothing in the syslog, no message at all.
>
> Try running it manually in debug mode:
>
> nohup flowd -dg &
>
> and see what is in nohup.out if/when it crashes.
>
> -d
>

Hi Damien,

Softflowd crashed again.

In the nohup.out file I found the following:

Shutting down after pcap EOF
Shutting down on user request
Starting expiry scan: mode -1
...then a lot of...
Queuing flow seq:102338 ... for expiry reason 6 (or 5, 1, 4)
...then a lot of...
EXPIRED: seq:...

With nfdump I found again (at that time) entries with date about six weeks
in the future.
But no wrong date in nohup.out.

What happened? Network error? Corrupted file? Socket problem?

The host gets monitored an was reachable (pingable) at that time.
The nohup.out file is 40MB big and contains a lot of IP addresses.
That's why I hesitate to send it to you and especially the list.

Any idea, what I can try next?

xela