Telegraf show huge spikes for Network in Grafana

lohitv9 · July 20, 2017, 6:46pm

Issue: Huge spikes in output and negative values shown on grafana at random times from network plugin.

My current setup is:

CentOS 7
telegraf-1.3.4-1.x86_64
influxdb-1.2.4-1.x86_64
grafana-4.3.1-1.x86_64

This is an HPC cluster, where telegraf runs on multiple compute nodes, and sends output to influxdb and grafana installed on a VM.

The issue that i observe is with Telegraf network plugin (net)

My requirement is to, get network throughput as a cluster - that is network throughput for a group of compute nodes. The following is the query that i use in grafana for influxdb.

SELECT derivative(sum(bytes_recv), 1s) as “received bytes/sec”, derivative(sum(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “cluster” =~ /^$cluster$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)

As you can see above, i am just collecting the network stats for all interfaces except few of unique ones. I then do the derivative as explained in the documentation of network plugin for telegraf.
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/system/NET_README.md

The above query works perfectly fine most of the times, but does get huge spikes during particular times. I very well know that it is not possible for such huge spikes because the aggregate is sometimes just for about 8 nodes.

I have attached the images from grafana that work and those do not work.
When i query the respective values during those times of spikes, i do see the weird values of -300TBps etc…

The same query as above, when done per host/node works perfectly fine when grouped by interface, although i have not verified what node is causing the spikes and see if the host network output shows a spike during the same time when the cluster shows a spike. The following is the query that i use per host.

SELECT derivative(first(bytes_recv), 1s) as “received bytes/sec”, derivative(first(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY interface,time($interval) fill(null))

However when i dont group it per interface, it does give me wierd outputs.

SELECT derivative(sum(bytes_recv),1s) as “received bytes/sec”, derivative(sum(bytes_sent),1s) as “sent bytes/sec” FROM net WHERE “host” =~ /^$host$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY time($interval) fill(null)

I cannot group per interface on cluster level, since it will give me multiple graphs making it unreadable.

Please let me know, if there is a mistake in my query that is correctable or if i need to approach the requirement in a different manner.

Thanks,

chris.churilo · December 7, 2017, 5:36pm

Please use a non_negative_derivative

lohitv9 · February 14, 2018, 2:10am

I tried setting non_negative_derivative but it still does not work.
I have been at the influxdays/nyc today and discussed the same with Daniel.

I have also drilled this down to the host level, just to eliminate - if the issue is with grouping multiple hosts.
I see that it also happens on a single host, when collecting metrics with telegraf.

Please find the attached screenshot.

Something seems to have happened around 20:46 that might have created this spike from telegraf metrics.
Can this happen, if telegraf hangs for some reason?

lohitv9 · February 14, 2018, 2:23am

I believe this is related to the open issue:

github.com/influxdata/influxdb

non_negative_derivative and group by sizing causing large spikes

opened 03:42PM - 22 Apr 16 UTC

shawnahall71

area/queries flux/triaged 1.x

Hi, We're using InfluxDB 0.12.0 and Grafana 3.0.0 beta31460390657. We're monito…ring systems that output continually increasing counters (Lustre) via Telegraf. The metrics are being collected by Telegraf every 10s. For smaller time ranges the graphs are accurate. However for larger time periods InfluxDB sends back data with a very large spike at the beginning. I've noticed this seems to stem from the larger `GROUP BY time()` values that Grafana selects when you view data over a larger time period. If I manually select a `GROUP BY time(10s)` the data is always correct. If I let Grafana auto-select with a `GROUP BY time($interval)`, I see the issue. Note that the issue is periodic - sometimes the large interval works and sometimes it doesn't. I'm guessing this is because I'm using `now()` in my queries so at certain times the times align correctly. So to cut Grafana out of the equation I've manually run the queries to illustrate the issue. If I use `GROUP BY time(10s)`, this query always returns correct data no matter how many times I try it: ``` > SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(10s) fill(null) LIMIT 10; name: lustre2 ------------- time non_negative_derivative 1461153160000000000 0 1461153170000000000 0 1461153180000000000 0 1461153190000000000 0 1461153200000000000 0 1461153210000000000 0 1461153220000000000 0 1461153230000000000 0 1461153240000000000 0 1461153250000000000 0 ``` However, if I increase to `GROUP BY time(30s)` and run two queries that were less than a second apart: ``` > SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10; name: lustre2 ------------- time non_negative_derivative 1461153210000000000 0 1461153240000000000 0 1461153270000000000 0 1461153300000000000 0 1461153330000000000 0 1461153360000000000 0 1461153390000000000 0 1461153420000000000 0 1461153450000000000 0 1461153480000000000 0 > SELECT non_negative_derivative(sum("read_bytes"), 1s) FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) fill(null) LIMIT 10; name: lustre2 ------------- time non_negative_derivative 1461153210000000000 4.387031397146e+11 1461153240000000000 0 1461153270000000000 0 1461153300000000000 0 1461153330000000000 0 1461153360000000000 0 1461153390000000000 0 1461153420000000000 0 1461153450000000000 0 1461153480000000000 0 ``` In fact if I run this query every second, in seconds 1 - 10 the first value is 0, in seconds 11-20, the first value is a very large number, and in seconds 21 - 30, the first value is an even larger number (almost exactly double). Then the first value returns to 0 after that and the cycle starts again. This is due to the sum values being very different (three queries run with about 10s in between): ``` > SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10; name: lustre2 ------------- time sum 1461153630000000000 13161094191438 1461153660000000000 39483282574314 1461153690000000000 39483282574314 1461153720000000000 39483282574314 1461153750000000000 39483282574314 1461153780000000000 39483282574314 1461153810000000000 39483282574314 1461153840000000000 39483282574314 1461153870000000000 39483282574314 1461153900000000000 39483282574314 > SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10; name: lustre2 ------------- time sum 1461153660000000000 26322188382876 1461153690000000000 39483282574314 1461153720000000000 39483282574314 1461153750000000000 39483282574314 1461153780000000000 39483282574314 1461153810000000000 39483282574314 1461153840000000000 39483282574314 1461153870000000000 39483282574314 1461153900000000000 39483282574314 1461153930000000000 39483282574314 > SELECT sum("read_bytes") FROM "lustre2" WHERE time > now() - 6h GROUP BY time(30s) LIMIT 10; name: lustre2 ------------- time sum 1461153660000000000 1461153690000000000 39483282574314 1461153720000000000 39483282574314 1461153750000000000 39483282574314 1461153780000000000 39483282574314 1461153810000000000 39483282574314 1461153840000000000 39483282574314 1461153870000000000 39483282574314 1461153900000000000 39483282574314 1461153930000000000 39483282574314 ``` When I look at the raw data points they are staying constant as expected since there is no activity right now (using one host as an example): ``` > SELECT read_bytes FROM "lustre2" WHERE "host"='hpcleo00' GROUP BY "name" LIMIT 10; name: lustre2 tags: name=lustree-OST0000 time read_bytes ---- ---------- 1459977920000000000 39325212672 1459977930000000000 39325212672 1459977940000000000 39325212672 1459977950000000000 39325212672 1459977960000000000 39325212672 1459977970000000000 39325212672 1459977980000000000 39325212672 1459977990000000000 39325212672 1459978000000000000 39325212672 1459978010000000000 39325212672 name: lustre2 tags: name=lustree-OST0001 time read_bytes ---- ---------- 1459977920000000000 39718395904 1459977930000000000 39718395904 1459977940000000000 39718395904 1459977950000000000 39718395904 1459977960000000000 39718395904 1459977970000000000 39718395904 1459977980000000000 39718395904 1459977990000000000 39718395904 1459978000000000000 39718395904 1459978010000000000 39718395904 name: lustre2 tags: name=lustree-OST0002 time read_bytes ---- ---------- 1459977920000000000 59411320832 1459977930000000000 59411320832 1459977940000000000 59411320832 1459977950000000000 59411320832 1459977960000000000 59411320832 1459977970000000000 59411320832 1459977980000000000 59411320832 1459977990000000000 59411320832 1459978000000000000 59411320832 1459978010000000000 59411320832 name: lustre2 tags: name=lustree-OST0003 time read_bytes ---- ---------- 1459977920000000000 59857276928 1459977930000000000 59857276928 1459977940000000000 59857276928 1459977950000000000 59857276928 1459977960000000000 59857276928 1459977970000000000 59857276928 1459977980000000000 59857276928 1459977990000000000 59857276928 1459978000000000000 59857276928 1459978010000000000 59857276928 name: lustre2 tags: name=lustree-OST0004 time read_bytes ---- ---------- 1459977920000000000 39734607872 1459977930000000000 39734607872 1459977940000000000 39734607872 1459977950000000000 39734607872 1459977960000000000 39734607872 1459977970000000000 39734607872 1459977980000000000 39734607872 1459977990000000000 39734607872 1459978000000000000 39734607872 1459978010000000000 39734607872 ``` Note that I have also tried all the fill options (none, null, 0, previous) and they do not help. I've read through https://github.com/influxdata/influxdb/issues/4237, https://github.com/influxdata/influxdb/pull/4292, and all the linked issues, but I think this is different. Thanks, Shawn

daniel · February 15, 2018, 11:06pm

I think that issue only relates to spikes that occur at the first timestamp, but your spike is in the middle. I think we should check to see which interface is causing the spike, can you group by interface?

lohitv9 · April 23, 2018, 5:16pm

Daniel,

I had isolated that spike, by choosing that respective time period on grafana. Thus you see the spike is in between.

I isolated it by interface and by host, but it does not make sense on why that interface on that host is showing such a spike on metric…
It looks like influxdb has received the value for that interface at that point of time to be such a huge value.

I selected the below metric on host level.

SELECT non_negative_derivative(first(bytes_sent), 1s) FROM “net” WHERE “host” =~ /^host/ AND timeFilter GROUP BY time(__interval) fill(null)

I get the values as attached in the screenshot, with the table panel.
The metric that is showing 129GBps is wrong, and that interface cannot do more than 1000MBps or 1GBps

It is really important that we have this working the right way.
Please do let us know, how to resolve the issue.

Thanks,
Lohit

daniel · April 23, 2018, 6:46pm

I think we need to look at the raw data, can you run a query like this one but using the timestamp around the spike +/- 1 minute:

influx -database 'telegraf' -execute "select bytes_sent from net where time > '2018-04-23T18:44:00Z' and time < '2018-04-23T18:46:00Z' group by host,interface"

lohitv9 · April 23, 2018, 7:16pm

Thanks Daniel.

Here is the output, for :
influx -ssl -host xxxx -database ‘telegraf’ -execute “select bytes_sent from net where host=‘lt16’ and time > ‘2018-04-22T21:04:10Z’ and time < ‘2018-04-22T21:06:55Z’ group by host,interface” -username “xxxx” -password ‘’

1524431120000000000 0
1524431130000000000 0
1524431140000000000 0
1524431150000000000 0

1524431120000000000 31619137377436
1524431130000000000 31620249849206
1524431140000000000 31621340368940
1524431150000000000 31622383670572

And the output for +1m and -1m

1524431060000000000 0
1524431070000000000 0
1524431080000000000 0
1524431090000000000 0
1524431100000000000 0
1524431110000000000 0
1524431120000000000 0
1524431130000000000 0
1524431140000000000 0
1524431150000000000 0
1524431160000000000 0
1524431170000000000 0
1524431180000000000 0
1524431190000000000 0
1524431200000000000 0
1524431210000000000 0

1524431060000000000 31613765246860
1524431070000000000 31614588506033
1524431080000000000 31615391577426
1524431090000000000 31616486228563
1524431100000000000 31617562782610
1524431110000000000 31618345181741
1524431120000000000 31619137377436
1524431130000000000 31620249849206
1524431140000000000 31621340368940
1524431150000000000 31622383670572
1524431160000000000 31623395473844
1524431170000000000 31624396227188
1524431180000000000 31625172159398
1524431190000000000 31626217562577
1524431200000000000 31626948659874
1524431210000000000 31627957323564

It is the ens21f1 interface that is showing the spike.
Thanks,
Lohit

lohitv9 · April 23, 2018, 7:51pm

I just realized i forgot to change timestamp to UTC. Please ignore my previous output. Here is the updated output.

influx -ssl -host xxxxxx -database ‘telegraf’ -execute “select non_negative_derivative(first(“bytes_sent”), 10s) from net where host=‘lt16’ and time > ‘2018-04-23T01:05:05Z’ and time < ‘2018-04-23T01:05:55Z’ group by host,interface,time(10s)” -username “xxxxx” -password ‘’
password:
name: net
tags: host=lt16, interface=ens21f0
time non_negative_derivative

1524445510000000000 0
1524445520000000000 0
1524445530000000000 0
1524445540000000000 0
1524445550000000000 0

1524445510000000000 1.42520748e+08
1524445520000000000 1.43145285e+08
1524445540000000000 1.296417722315e+12
1524445550000000000 1.57398396e+08

As you can see, the bytes_sent output for ens21f1 is missing values from one time period.
I have also attached the respective grafana output for the same query

lohitv9 · April 23, 2018, 7:58pm

Here is the output for a simple “bytes_sent” query.

influx -ssl -host xxxxx -database ‘telegraf’ -execute “select “bytes_sent” from net where host=‘lt16’ and time > ‘2018-04-23T01:05:05Z’ and time < ‘2018-04-23T01:05:55Z’ group by host,interface” -username “xxxxx” -password ‘’
password:
name: net
tags: host=lt16, interface=ens21f0
time bytes_sent

1524445510000000000 0
1524445520000000000 0
1524445530000000000 0
1524445540000000000 0
1524445550000000000 0

1524445510000000000 32226907369757
1524445520000000000 32227050515042
1524445530000000000 30930921960948
1524445540000000000 32227339683263
1524445550000000000 32227497081659

I see the same on grafana too.

daniel · April 23, 2018, 8:47pm

For some reason the value dipped, which shouldn’t ever happen except for during reboots. The values are taken from /proc/net/dev and it seems unlikely we are parsing it wrong and still getting reasonable values most of the time.

What kind of system is this? Is it a bare-metal, virtual machine, docker container, etc? Is there anything special about this interface, for instance is it bridged or bonded?

lohitv9 · April 24, 2018, 2:18pm

It is a bare-metal.
Nothing special about this interface. It is a 10G physical interface, and is not bridged/bonded.
It is a compute node, part of HPC cluster. So the network usage can be quite variable.

Do you think it is that dip from 32227050515042 to 30930921960948 , that is causing the huge spike from millions to trillions when queried with a derivative?
Can it be caused because of high load on influxdb or telegraf?

Thanks,
Lohit

daniel · April 24, 2018, 8:18pm

Yes, this is a very large number over a 10s period, over 1 TB in 10 seconds. Could you run cat /proc/net/dev and upload the output, maybe there will be a clue in there. Also, about how often would you say this happens?

lohitv9 · April 24, 2018, 9:39pm

I see this very frequently. I gather the network usage for approximately 100 to 150 servers in each cluster, and the trend graph almost always show huge spikes and thus messes up the graph and the max/min values.

The output for /proc/net/dev for lt16 is below.

cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 80629029434768 4741352496 0 29506 0 0 0 2088375 33616616118465 4108581472 0 0 0 0 0 0
lo: 69952469262 13693112 0 0 0 0 0 0 69952469262 13693112 0 0 0 0 0 0

Also,
I do sometimes see such errors with telegraf:

Apr 11 11:10:25 lt16 telegraf[36581]: 2018-04-11T15:10:25Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
Apr 11 11:10:35 lt16 telegraf[36581]: 2018-04-11T15:10:35Z E! InfluxDB Output Error: Post https://lila-info01.hpc.private:8086/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Apr 11 11:10:35 lt16 telegraf[36581]: 2018-04-11T15:10:35Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

I am not sure, if these errors could be the cause for these spikes.
Because for example - the most recent spikes on lt16, were on Apr 22 while the errors were on Apr 11th, so this might be a red-herring.

daniel · April 24, 2018, 11:00pm

Nothing stands out to me in your /proc/net/dev. Since you have so many hosts do you see this on all of them? How similar is the hardware between them, and what is the operating system and version?

I’m also curious what the driver is for the network card? On my system I can do this:

sudo readlink -f /sys/class/net/ens21f1/device/driver/

lohitv9 · April 25, 2018, 7:08pm

As of now - I do not have information on how many hosts are seeing this issue, and if i do see a pattern.
The hardware is similar for about 30 servers from 100.
Operating System is CentOS7 . Kernel 3.10.0-514.21.2.el7.x86_64

readlink -f /sys/class/net/ens21f1/device/driver/
/sys/bus/pci/drivers/mlx5_core

Thats a mellanox driver i guess, because this system has a mellanox network card.
Not sure, if this really comes down to drivers/OS/hardware.
I think i see this independent of what hardware i select.

I am experimenting on a similar setup in a different cluster, and will see if this issue occurs there too.

daniel · April 25, 2018, 11:57pm

We could try to catch the value momentarily decreasing by capturing the contents every 10s, if we observed it we would know it is a kernel issue. I’m not sure if it would be worth the effort though.

It occurs to me that if the dip only occurs for one period that you could try to filter it away with a larger group by period, while not ideal maybe this would be an improvement:

select non_negative_derivative(max("bytes_sent"), 20s) from net where ... group by host, interface,time(20s)

lohitv9 · April 26, 2018, 1:25pm

I like the idea of using something else to track the dip in values.

Can I use telegraf to log/output the values it is sending to influxdb?
May be I can use a file as one of the outputs for net input?

Or may be I should use an exec to output the proc/net file somewhere so that I know what the file contains when telegraf is reading from it.

Do you think… the values will be the same that telegraf sends to influxdb , when I run an exec on the same net file?

May be I should do both the above, so I will know if the issue is with the kernel or with telegraf or with influxdb.

daniel · April 26, 2018, 6:28pm

Using Telegraf to run exec would have the benefit of aligning the queries faily closely, using exec would probably be preferred over net so that we get the full unprocessed contents of the file. We could use the “value” format and capture it as a string, it will be a little hard to read but should be too hard. On the output side use namepass/namedrop to exclude it from the real InfluxDB output, something like:

[[inputs.exec]]
  name_override = "proc_net_dev"
  commands = ["cat /proc/net/dev"]
  data_format = "value"
  data_type = "string"

[[outputs.file]]
  namepass = ["proc_net_dev"]
  files = ["net.output"]
  data_format = "influx"

[[outputs.influxdb]]
  namedrop = ["proc_net_dev"]

lohitv9 · April 26, 2018, 6:48pm

Thanks Daniel.

I will work on that, for all the nodes and will let you know if i see something strange.

Regards,
Lohit

Topic		Replies	Views
Grafana Influx db net traffic way to high Telegraf influxdb , telegraf , grafana	1	618	September 21, 2018
Spikes on graphs in grafana influxdb , telegraf , grafana , flux	1	35	December 10, 2024
Influx > Telegraf write issue Telegraf influxdb , grafana	3	27	October 7, 2024
Telegraf data is being sent to multiple bucket Telegraf telegraf	4	1808	January 11, 2022
Telegraf vsphere Plugin - Plot total number of VMs in the vCenter against time influxdb , telegraf	0	936	February 7, 2020

Telegraf show huge spikes for Network in Grafana

Related topics