Issue: Huge spikes in output and negative values shown on grafana at random times from network plugin.
My current setup is:
CentOS 7
telegraf-1.3.4-1.x86_64
influxdb-1.2.4-1.x86_64
grafana-4.3.1-1.x86_64
This is an HPC cluster, where telegraf runs on multiple compute nodes, and sends output to influxdb and grafana installed on a VM.
The issue that i observe is with Telegraf network plugin (net)
My requirement is to, get network throughput as a cluster - that is network throughput for a group of compute nodes. The following is the query that i use in grafana for influxdb.
SELECT derivative(sum(bytes_recv), 1s) as “received bytes/sec”, derivative(sum(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “cluster” =~ /^$cluster$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)
As you can see above, i am just collecting the network stats for all interfaces except few of unique ones. I then do the derivative as explained in the documentation of network plugin for telegraf.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/NET_README.md
The above query works perfectly fine most of the times, but does get huge spikes during particular times. I very well know that it is not possible for such huge spikes because the aggregate is sometimes just for about 8 nodes.
I have attached the images from grafana that work and those do not work.
When i query the respective values during those times of spikes, i do see the weird values of -300TBps etc…
The same query as above, when done per host/node works perfectly fine when grouped by interface, although i have not verified what node is causing the spikes and see if the host network output shows a spike during the same time when the cluster shows a spike. The following is the query that i use per host.
SELECT derivative(first(bytes_recv), 1s) as “received bytes/sec”, derivative(first(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY interface,time($interval) fill(null))
However when i dont group it per interface, it does give me wierd outputs.
SELECT derivative(sum(bytes_recv),1s) as “received bytes/sec”, derivative(sum(bytes_sent),1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)
I cannot group per interface on cluster level, since it will give me multiple graphs making it unreadable.
Please let me know, if there is a mistake in my query that is correctable or if i need to approach the requirement in a different manner.
Thanks,