Telegraf show huge spikes for Network in Grafana

telegraf
influxdb
grafana

#21

I have asked to output as necessary, and here are the details:

I see this difference between the fileoutput and influxdboutput at this timestamp:
influxdoutput: 1524818640000000000 12197298932002
filoutput: 1524818640000000000 13007685030421

I am not sure, on how influxdb received 12197298932002 as the value, when it is supposed to receive 13007685030421.

The query on Grafana, for trend graph:
SELECT non_negative_derivative(sum(bytes_sent), 1s) as “sent bytes/sec” FROM net WHERE “cluster” =~ /^$cluster$/ AND role =~ /^$role$/ AND interface != ‘all’ AND interface !~ /^bond/ AND interface !~ /^br/ AND interface !~ /^vnet/ AND $timeFilter GROUP BY host,interface,time(10s) fill(null)

You can see the spike for lt09, ens21f1 for bytes_sent

Here is the query and the output from influxdb:
influx -ssl -host xxxxxx -database ‘telegraf’ -execute “select “bytes_sent” from net where host=‘lt09’ and time > ‘2018-04-27T08:43:30Z’ and time < ‘2018-04-27T08:44:50Z’ group by host,interface” -username “xxxx” -password ‘’
password:
name: net
tags: host=lt09, interface=ens21f1
time bytes_sent


1524818620000000000 13007676084323
1524818630000000000 13007681610031
1524818640000000000 12197298932002
1524818650000000000 13007685303321
1524818660000000000 13007685765056
1524818670000000000 13007687108607
1524818680000000000 13007688326894

Here is the output from the file:
proc_net_dev,gpu=NVIDIA,role=lila-compute,host=lt09,cluster=lila,domain=hpc.private value=“Inter-| Receive
| Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28528483234009 2125214641 1 4895 0 1 0 1057939 13007676084323 1949452383 0 0 0 0 0 0
lo: 701701032 5500222 0 0 0 0 0 0 701701032 5500222 0 0 0 0 0 0” 1524818620000000000
proc_net_dev,domain=hpc.private,gpu=NVIDIA,role=lila-compute,host=lt09,cluster=lila value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28528521274491 2125217407 1 4895 0 1 0 1057950 13007681610031 1949455120 0 0 0 0 0 0
lo: 701704588 5500251 0 0 0 0 0 0 701704588 5500251 0 0 0 0 0 0” 1524818630000000000
proc_net_dev,role=lila-compute,host=lt09,cluster=lila,domain=hpc.private,gpu=NVIDIA value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28529273742673 2125241917 1 4895 0 1 0 1057958 13007685030421 1949478353 0 0 0 0 0 0
lo: 701708024 5500278 0 0 0 0 0 0 701708024 5500278 0 0 0 0 0 0” 1524818640000000000
proc_net_dev,role=lila-compute,host=lt09,cluster=lila,domain=hpc.private,gpu=NVIDIA value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28529353584953 2125244127 1 4895 0 1 0 1057968 13007685303321 1949480505 0 0 0 0 0 0
lo: 701712752 5500320 0 0 0 0 0 0 701712752 5500320 0 0 0 0 0 0” 1524818650000000000
proc_net_dev,domain=hpc.private,gpu=NVIDIA,role=lila-compute,host=lt09,cluster=lila value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28529492562647 2125249077 1 4895 0 1 0 1057977 13007685765056 1949485290 0 0 0 0 0 0
lo: 701716308 5500349 0 0 0 0 0 0 701716308 5500349 0 0 0 0 0 0” 1524818660000000000
proc_net_dev,domain=hpc.private,gpu=NVIDIA,role=lila-compute,host=lt09,cluster=lila value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28530073684855 2125268019 1 4895 0 1 0 1057991 13007687108607 1949503533 0 0 0 0 0 0
lo: 701719744 5500376 0 0 0 0 0 0 701719744 5500376 0 0 0 0 0 0” 1524818670000000000
proc_net_dev,host=lt09,cluster=lila,domain=hpc.private,gpu=NVIDIA,role=lila-compute value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 28530586529115 2125285137 1 4895 0 1 0 1057999 13007688326894 1949519740 0 0 0 0 0 0
lo: 701724472 5500418 0 0 0 0 0 0 701724472 5500418 0 0 0 0 0 0” 1524818680000000000


#22

Very interesting, would you be willing to run a custom build with some extra debugging?


#23

Sure,

But is it a custom build for telegraf or influxdb?
Do we know if the issue is related to telegraf of influxdb, else is there any other way - we can find that out?


#24

With our previous test there is still the possibility of /proc/dev/net changing for only a very small time, so we haven’t eliminated any of the points of error. We can verify if it is Telegraf or InfluxDB side by adding a file output like this that will save the same values we are sending to InfluxDB:

[[outputs.file]]
  namepass = ["net"]
  files = ["net.output"]
  data_format = "influx"

Can you run this until you see a spike and then compare the values?


#25

Sure.
I was thinking of the same.
I will enable the above, and update soon.

Thank you.


#26

I collected the output from both telegraf and /proc/net/dev and below are the results:

As you can see - the value is supposed to be 26165411222784, but instead it is: 10771612740183i

packets_sent=1262282537i,err_in=0i,err_out=0i,drop_in=11982i,bytes_sent=10771612740183i 1525315410000000000
value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 47019231650603 3340896996 0 11982 0 0 0 2900308 26165411222784 2875037371 0 0 0 0 0 0
lo: 29651959085 12172164 0 0 0 0 0 0 29651959085 12172164 0 0 0 0 0 0” 1525315410000000000

And the results from query are:

tags: host=lt19, interface=ens21f1
time bytes_sent


1525315400000000000 26165368181479
1525315410000000000 10771612740183
1525315420000000000 26165436702588

Please do let me know, the custom build of telegraf that we should run to debug to this issue.

Thanks,
Lohit


#27

Do you have the corresponding output from the file plugin?


#28

Yes, I had pasted both previously one below the other:

File output:
value=“Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 47019231650603 3340896996 0 11982 0 0 0 2900308 26165411222784 2875037371 0 0 0 0 0 0
lo: 29651959085 12172164 0 0 0 0 0 0 29651959085 12172164 0 0 0 0 0 0” 1525315410000000000

telegraf output (truncated):
packets_sent=1262282537i,err_in=0i,err_out=0i,drop_in=11982i,bytes_sent=10771612740183i 1525315410000000000


#29

Here is an amd64 rpm package with these changes. This build will write timestamps and the actual /proc/net/dev contents used to produce the net metrics to /tmp/telegraf-debug.

You can remove the exec input but continue to run with the file output enabled, and then we can line up the actual input with the file output when the issue occurs.


#30

Thanks Daniel.
Here are the results:

telegrad-debug output:
1525451320000599765
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 1672965201657 432797808 0 0 0 0 0 2160008 2492513461302 393448714 0 0 0 0 0 0
lo: 617249140 4892323 0 0 0 0 0 0 617249140 4892323 0 0 0 0 0 0

1525451330000595233
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 1672965291484 432798121 0 0 0 0 0 2160018 2492513569396 393449014 0 0 0 0 0 0
lo: 617250036 4892342 0 0 0 0 0 0 617250036 4892342 0 0 0 0 0 0

1525451340000552903
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eno1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
eno2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ens21f1: 1672965298488 432798220 0 0 0 0 0 2160029 2492513645677 393449059 0 0 0 0 0 0
lo: 617250812 4892359 0 0 0 0 0 0 617250812 4892359 0 0 0 0 0 0

telegraf net file output:
net,host=lt13,interface=ens21f1,bytes_recv=1672965201657i,packets_sent=393448
714i,packets_recv=432797808i,err_in=0i,err_out=0i,drop_in=0i,drop_out=0i,bytes_sent=2492513461302i 1525451320000000000
net,host=lt13,interface=ens21f1,drop_out=0i,bytes_sent=683061990701i,bytes_re
cv=1281246296447i,packets_sent=176574003i,packets_recv=368990809i,err_in=0i,err_out=0i,drop_in=0i 1525451330000000000
net,host=lt13,interface=ens21f1,bytes_recv=1672965298488i,packets_sent=393449059i,packets_recv=432798220i,err_in=0i,err_out=0i,drop_in=0i,drop_out=0i,bytes_sent=2492513645677i 1525451340000000000

Results from influx query:
name: net
tags: host=lt13, interface=ens21f1
time bytes_sent


1525451320000000000 2492513461302
1525451330000000000 683061990701
1525451340000000000 2492513645677
1525451350000000000 2492513691995


#31

Can you upload (if possible) or email me the files for about 30 minutes before and after this occurred?


#33

Hi,
I found this post having similar problem. I get spikes on only one interface from whole switch using telegraf snmp plugin. Counter for some reason spikes and than returns to previous value.

name: switch
time                ifHCInOctets      ifHCOutOctets
----                ------------      -------------
1551918628000000000 12466952951737919 38044572825981800
1551918683000000000 12466962905926618 38044595221859520
1551918743000000000 12466974294910852 38044621344021442
1551918803000000000 12466985914360278 38044649086717090
1551918863000000000 12466997193880800 38044672893515776
1551918927000000000 **13887862747606707** 38044700994131232
1551918984000000000 12467019666181026 38044724666552465
1551919042000000000 12467031376628505 38044747164217913
1551919101000000000 12467043199248346 38044773397350632
1551919164000000000 12467055406975616 38044800981337460

kind regards
Marek


#34

We debugged this offline and traced it back to an error where the line in /proc/net/dev, used by the net plugin, would sometimes emit a value that was way off. Attempting to reread the file would result it the value being correct.

Here is the script we used to duplicate the behavior outside of Telegraf, maybe it will find something on your system too, but don’t expect the timestamps to line up because it seems to happen only for a very short time:

#!/bin/sh

interface="${1:-ens21f1}"
echo "Watching $interface"
prev_proc_net_dev=$(cat /proc/net/dev)
prev_bytes_sent="$(echo "$prev_proc_net_dev" | grep "$interface" | awk '{print $10}')"
if [ "$prev_bytes_sent" = "" ]
then
	echo "Interface not found: $interface"
	exit 1
fi

while true
do
	curr_proc_net_dev=$(cat /proc/net/dev)
	curr_bytes_sent="$(echo "$curr_proc_net_dev" | grep "$interface" | awk '{print $10}')"
	if [ "$curr_bytes_sent" = "" ]
	then
		echo "Interface not found: $interface"
		exit 1
	fi

	if [ "$curr_bytes_sent" -lt "$prev_bytes_sent" ]
	then
		echo
		echo "$(date): bytes_sent has decreased on $interface: $prev_bytes_sent -> $curr_bytes_sent"

		echo "--- prev_proc_net_dev ------"
		echo "$prev_proc_net_dev"
		echo "----------------------------"

		echo "--- curr_proc_net_dev ------"
		echo "$curr_proc_net_dev"
		echo "----------------------------"
	else
		echo -n '.'
	fi

	prev_bytes_sent="$curr_bytes_sent"
	prev_proc_net_dev="$curr_proc_net_dev"
	sleep 10
done