Telegraf show huge spikes for Network in Grafana

Issue: Huge spikes in output and negative values shown on grafana at random times from network plugin.

My current setup is:

CentOS 7
telegraf-1.3.4-1.x86_64
influxdb-1.2.4-1.x86_64
grafana-4.3.1-1.x86_64

This is an HPC cluster, where telegraf runs on multiple compute nodes, and sends output to influxdb and grafana installed on a VM.

The issue that i observe is with Telegraf network plugin (net)

My requirement is to, get network throughput as a cluster - that is network throughput for a group of compute nodes. The following is the query that i use in grafana for influxdb.

SELECT derivative(sum(bytes_recv), 1s) as “received bytes/sec”, derivative(sum(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “cluster” =~ /^$cluster$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)

As you can see above, i am just collecting the network stats for all interfaces except few of unique ones. I then do the derivative as explained in the documentation of network plugin for telegraf.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/NET_README.md

The above query works perfectly fine most of the times, but does get huge spikes during particular times. I very well know that it is not possible for such huge spikes because the aggregate is sometimes just for about 8 nodes.

I have attached the images from grafana that work and those do not work.
When i query the respective values during those times of spikes, i do see the weird values of -300TBps etc…

The same query as above, when done per host/node works perfectly fine when grouped by interface, although i have not verified what node is causing the spikes and see if the host network output shows a spike during the same time when the cluster shows a spike. The following is the query that i use per host.

SELECT derivative(first(bytes_recv), 1s) as “received bytes/sec”, derivative(first(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY interface,time($interval) fill(null))

However when i dont group it per interface, it does give me wierd outputs.

SELECT derivative(sum(bytes_recv),1s) as “received bytes/sec”, derivative(sum(bytes_sent),1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)

I cannot group per interface on cluster level, since it will give me multiple graphs making it unreadable.

Please let me know, if there is a mistake in my query that is correctable or if i need to approach the requirement in a different manner.

Thanks,

Please use a non_negative_derivative

I tried setting non_negative_derivative but it still does not work.
I have been at the influxdays/nyc today and discussed the same with Daniel.

I have also drilled this down to the host level, just to eliminate - if the issue is with grouping multiple hosts.
I see that it also happens on a single host, when collecting metrics with telegraf.

Please find the attached screenshot.

Something seems to have happened around 20:46 that might have created this spike from telegraf metrics.
Can this happen, if telegraf hangs for some reason?

I believe this is related to the open issue:

I think that issue only relates to spikes that occur at the first timestamp, but your spike is in the middle. I think we should check to see which interface is causing the spike, can you group by interface?

Daniel,

I had isolated that spike, by choosing that respective time period on grafana. Thus you see the spike is in between.

I isolated it by interface and by host, but it does not make sense on why that interface on that host is showing such a spike on metric…
It looks like influxdb has received the value for that interface at that point of time to be such a huge value.

I selected the below metric on host level.

SELECT non_negative_derivative(first(bytes_sent), 1s) FROM “net” WHERE “host” =~ /^host/ AND timeFilter GROUP BY time(__interval) fill(null)

I get the values as attached in the screenshot, with the table panel.
The metric that is showing 129GBps is wrong, and that interface cannot do more than 1000MBps or 1GBps

It is really important that we have this working the right way.
Please do let us know, how to resolve the issue.

Thanks,
Lohit

I think we need to look at the raw data, can you run a query like this one but using the timestamp around the spike +/- 1 minute:

influx -database 'telegraf' -execute "select bytes_sent from net where time > '2018-04-23T18:44:00Z' and time < '2018-04-23T18:46:00Z' group by host,interface"

Thanks Daniel.

Here is the output, for :
influx -ssl -host xxxx -database ‘telegraf’ -execute “select bytes_sent from net where host=‘lt16’ and time > ‘2018-04-22T21:04:10Z’ and time < ‘2018-04-22T21:06:55Z’ group by host,interface” -username “xxxx” -password ‘’

name: net
tags: host=lt16, interface=ens21f0
time bytes_sent


1524431120000000000 0
1524431130000000000 0
1524431140000000000 0
1524431150000000000 0

name: net
tags: host=lt16, interface=ens21f1
time bytes_sent


1524431120000000000 31619137377436
1524431130000000000 31620249849206
1524431140000000000 31621340368940
1524431150000000000 31622383670572

And the output for +1m and -1m

name: net
tags: host=lt16, interface=ens21f0
time bytes_sent


1524431060000000000 0
1524431070000000000 0
1524431080000000000 0
1524431090000000000 0
1524431100000000000 0
1524431110000000000 0
1524431120000000000 0
1524431130000000000 0
1524431140000000000 0
1524431150000000000 0
1524431160000000000 0
1524431170000000000 0
1524431180000000000 0
1524431190000000000 0
1524431200000000000 0
1524431210000000000 0

name: net
tags: host=lt16, interface=ens21f1
time bytes_sent


1524431060000000000 31613765246860
1524431070000000000 31614588506033
1524431080000000000 31615391577426
1524431090000000000 31616486228563
1524431100000000000 31617562782610
1524431110000000000 31618345181741
1524431120000000000 31619137377436
1524431130000000000 31620249849206
1524431140000000000 31621340368940
1524431150000000000 31622383670572
1524431160000000000 31623395473844
1524431170000000000 31624396227188
1524431180000000000 31625172159398
1524431190000000000 31626217562577
1524431200000000000 31626948659874
1524431210000000000 31627957323564

It is the ens21f1 interface that is showing the spike.
Thanks,
Lohit

I just realized i forgot to change timestamp to UTC. Please ignore my previous output. Here is the updated output.

influx -ssl -host xxxxxx -database ‘telegraf’ -execute “select non_negative_derivative(first(“bytes_sent”), 10s) from net where host=‘lt16’ and time > ‘2018-04-23T01:05:05Z’ and time < ‘2018-04-23T01:05:55Z’ group by host,interface,time(10s)” -username “xxxxx” -password ‘’
password:
name: net
tags: host=lt16, interface=ens21f0
time non_negative_derivative


1524445510000000000 0
1524445520000000000 0
1524445530000000000 0
1524445540000000000 0
1524445550000000000 0

name: net
tags: host=lt16, interface=ens21f1
time non_negative_derivative


1524445510000000000 1.42520748e+08
1524445520000000000 1.43145285e+08
1524445540000000000 1.296417722315e+12
1524445550000000000 1.57398396e+08

As you can see, the bytes_sent output for ens21f1 is missing values from one time period.
I have also attached the respective grafana output for the same query

Here is the output for a simple “bytes_sent” query.

influx -ssl -host xxxxx -database ‘telegraf’ -execute “select “bytes_sent” from net where host=‘lt16’ and time > ‘2018-04-23T01:05:05Z’ and time < ‘2018-04-23T01:05:55Z’ group by host,interface” -username “xxxxx” -password ‘’
password:
name: net
tags: host=lt16, interface=ens21f0
time bytes_sent


1524445510000000000 0
1524445520000000000 0
1524445530000000000 0
1524445540000000000 0
1524445550000000000 0

name: net
tags: host=lt16, interface=ens21f1
time bytes_sent


1524445510000000000 32226907369757
1524445520000000000 32227050515042
1524445530000000000 30930921960948
1524445540000000000 32227339683263
1524445550000000000 32227497081659

I see the same on grafana too.

For some reason the value dipped, which shouldn’t ever happen except for during reboots. The values are taken from /proc/net/dev and it seems unlikely we are parsing it wrong and still getting reasonable values most of the time.

What kind of system is this? Is it a bare-metal, virtual machine, docker container, etc? Is there anything special about this interface, for instance is it bridged or bonded?

It is a bare-metal.
Nothing special about this interface. It is a 10G physical interface, and is not bridged/bonded.
It is a compute node, part of HPC cluster. So the network usage can be quite variable.

Do you think it is that dip from 32227050515042 to 30930921960948 , that is causing the huge spike from millions to trillions when queried with a derivative?
Can it be caused because of high load on influxdb or telegraf?

Thanks,
Lohit

Yes, this is a very large number over a 10s period, over 1 TB in 10 seconds. Could you run cat /proc/net/dev and upload the output, maybe there will be a clue in there. Also, about how often would you say this happens?

I see this very frequently. I gather the network usage for approximately 100 to 150 servers in each cluster, and the trend graph almost always show huge spikes and thus messes up the graph and the max/min values.

The output for /proc/net/dev for lt16 is below.

cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 80629029434768 4741352496 0 29506 0 0 0 2088375 33616616118465 4108581472 0 0 0 0 0 0
lo: 69952469262 13693112 0 0 0 0 0 0 69952469262 13693112 0 0 0 0 0 0

Also,
I do sometimes see such errors with telegraf:

Apr 11 11:10:25 lt16 telegraf[36581]: 2018-04-11T15:10:25Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
Apr 11 11:10:35 lt16 telegraf[36581]: 2018-04-11T15:10:35Z E! InfluxDB Output Error: Post https://lila-info01.hpc.private:8086/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Apr 11 11:10:35 lt16 telegraf[36581]: 2018-04-11T15:10:35Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

I am not sure, if these errors could be the cause for these spikes.
Because for example - the most recent spikes on lt16, were on Apr 22 while the errors were on Apr 11th, so this might be a red-herring.

Nothing stands out to me in your /proc/net/dev. Since you have so many hosts do you see this on all of them? How similar is the hardware between them, and what is the operating system and version?

I’m also curious what the driver is for the network card? On my system I can do this:

sudo readlink -f /sys/class/net/ens21f1/device/driver/

As of now - I do not have information on how many hosts are seeing this issue, and if i do see a pattern.
The hardware is similar for about 30 servers from 100.
Operating System is CentOS7 . Kernel 3.10.0-514.21.2.el7.x86_64

readlink -f /sys/class/net/ens21f1/device/driver/
/sys/bus/pci/drivers/mlx5_core

Thats a mellanox driver i guess, because this system has a mellanox network card.
Not sure, if this really comes down to drivers/OS/hardware.
I think i see this independent of what hardware i select.

I am experimenting on a similar setup in a different cluster, and will see if this issue occurs there too.

We could try to catch the value momentarily decreasing by capturing the contents every 10s, if we observed it we would know it is a kernel issue. I’m not sure if it would be worth the effort though.

It occurs to me that if the dip only occurs for one period that you could try to filter it away with a larger group by period, while not ideal maybe this would be an improvement:

select non_negative_derivative(max("bytes_sent"), 20s) from net where ... group by host, interface,time(20s)

I like the idea of using something else to track the dip in values.

Can I use telegraf to log/output the values it is sending to influxdb?
May be I can use a file as one of the outputs for net input?

Or may be I should use an exec to output the proc/net file somewhere so that I know what the file contains when telegraf is reading from it.

Do you think… the values will be the same that telegraf sends to influxdb , when I run an exec on the same net file?

May be I should do both the above, so I will know if the issue is with the kernel or with telegraf or with influxdb.

Using Telegraf to run exec would have the benefit of aligning the queries faily closely, using exec would probably be preferred over net so that we get the full unprocessed contents of the file. We could use the “value” format and capture it as a string, it will be a little hard to read but should be too hard. On the output side use namepass/namedrop to exclude it from the real InfluxDB output, something like:

[[inputs.exec]]
  name_override = "proc_net_dev"
  commands = ["cat /proc/net/dev"]
  data_format = "value"
  data_type = "string"

[[outputs.file]]
  namepass = ["proc_net_dev"]
  files = ["net.output"]
  data_format = "influx"

[[outputs.influxdb]]
  namedrop = ["proc_net_dev"]

Thanks Daniel.

I will work on that, for all the nodes and will let you know if i see something strange.

Regards,
Lohit