SNMP: 64bit if_counter mis-reporting since upgrading to Telegraf 1.6.0 and InfluxDB 1.5.2

I’ve been using telegraf’s ‘new’ SNMP input plugin for over a year now, with great success, until I upgraded (via sudo apt-get update && sudo apt-get upgrade) to v1.6.0 the other day. (I was bumped to InfluxDB 1.5.2 at the same time.)

Since then, my interface counter values are not incrementing, so my Grafana panels for network input/output (based on non_negative_derivative of the raw 64bit interface counters) are all showing 0. :frowning:

wan_traffic_zero

It took me a while to work out how to debug, as I don’t touch the setup much when it’s working… but I’ve now enabled a text output (in addition to influxdb output) temporarily, and I can see that Telegraf is sending the same value for several interfaces, which never changes. Some samples:

These values are supposed to be from my router’s interfaces (via SNMP):

if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=eth0 bytes_recv=2147483647i,bytes_sent=2147483647i 1524127390000000000
if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=eth1 bytes_recv=2147483647i,bytes_sent=2147483647i 1524127390000000000

And these are supposed to be from my NAS (again via SNMP):

if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=eth1 bytes_recv=2147483647i,bytes_sent=2147483647i 1524127390000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=bond0 bytes_recv=2147483647i,bytes_sent=2147483647i 1524127390000000000

The value of 2147483647i seems to be the one that Telegraf has decided to stick with indefinitely, for some reason - very odd!

If I manually query the same hosts and OIDs via SNMP tools, the correct values are returned, and they change over time, as expected, e.g.:

$ snmpwalk -v 2c -c public 192.168.2.1 .1.3.6.1.2.1.31.1.1.1.6
SNMPv2-SMI::mib-2.31.1.1.1.6.1 = Counter64: 6441722
SNMPv2-SMI::mib-2.31.1.1.1.6.2 = Counter64: 619142896005
SNMPv2-SMI::mib-2.31.1.1.1.6.3 = Counter64: 224393229223
SNMPv2-SMI::mib-2.31.1.1.1.6.4 = Counter64: 0
SNMPv2-SMI::mib-2.31.1.1.1.6.5 = Counter64: 0

Could you please advise if this is likely to be a bug or perhaps something wrong with my SNMP input config? The same config has worked fine for the past year, I should add.

Just to add a bit more detail to this, it appears that some of the interface counters are reported accurately at the same time some others always report as a fixed value of 2147483647i. For example, the following snippet shows the lo (loopback) interface is reporting correctly, but eth0 and eth1 both have
2147483647i for both bytes_recv and bytes_sent.

if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=eth2 bytes_recv=0i,bytes_sent=0i 1524135570000000000
if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=imq0 bytes_recv=0i,bytes_sent=281088i 1524135570000000000
if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=lo bytes_recv=6441930i,bytes_sent=6441930i 1524135570000000000
if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=eth0 bytes_recv=2147483647i,bytes_sent=2147483647i 1524135570000000000
if_counters,agent_host=192.168.2.1,host=pi5,hostname=ubnt,interface=eth1 bytes_sent=2147483647i,bytes_recv=2147483647i 1524135570000000000

Similarly for another SNMP-monitored host, you can see the tun0 (VPN tunnel) and lo (loopback) interfaces are reported properly, but bond0, eth0 and eth1 are all shown as the magic number of 2147483647i - for both their received and sent counters:

if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=bond0 bytes_recv=2147483647i,bytes_sent=2147483647i 1524135570000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=tun0 bytes_recv=231166848i,bytes_sent=154510910i 1524135570000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=lo bytes_sent=1240111133i,bytes_recv=1240111133i 1524135570000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=sit0 bytes_recv=0i,bytes_sent=0i 1524135570000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=eth0 bytes_recv=2147483647i,bytes_sent=2147483647i 1524135570000000000
if_counters,agent_host=192.168.2.50,host=pi5,hostname=DiskStation,interface=eth1 bytes_recv=2147483647i,bytes_sent=2147483647i 1524135570000000000

Oh, I tried to roll-back to 1.5.3 to see if that would magically fix this issue, but it seems that only the latest ‘stable’ version is in the repo, so I couldn’t do it (via sudo apt-get install telegraf:armhf=1.5.3-1).

I think I may have found the issue… :wink:

Apparently 2147483647 is the largest 32-bit integer…

https://en.wikipedia.org/wiki/2,147,483,647#In_computing

And the SNMP values I’m fetching are the 64 bit interface counters (they need to be for gigabit networks really, as 32 bit counters roll around to 0 too quickly)… but that doesn’t explain why it was working fine before I moved to the latest versions of Telegraf and InfluxDB.
I’m guessing this might be a bug/regression of some sort, as it’s too much of a coincidence that multiple interfaces on multiple hosts all exceeded the 32 bit limit at around the same time (and the last time I saw a value from my non_negative_derivative function was around the same time as my ill-fated apt-get upgrade).

This seems like a bug, would it be possible for you open a new issue on the Telegraf github?

The 1.5.3 test would be very helpful as well, you can get the 1.5.3 package on the releases page. Sorry about it not being the in the apt repo, this is a limitation in the repo creation tool we are using.

Thanks Daniel, I’ve just raised it as issue 4052.

I can confirm that downgrading to Telegraf v1.5.3 (with the same config) resolves the issue. The real values are now being reported again, as expected:

if_counters,interface=eth0,hostname=ubnt,agent_host=192.168.2.1,host=pi5 bytes_sent=222581843967i,bytes_recv=624881268684i 1524211170000000000
if_counters,host=pi5,interface=eth1,hostname=ubnt,agent_host=192.168.2.1 bytes_recv=225626484913i,bytes_sent=625239852988i 1524211170000000000

No big deal to me, but the sudden leap in reported values for my interface counters when they started working this morning (with non_negative_derivative functions to give bits per second values), caused my Grafana panels to show some “quite big” WAN bandwidth use this morning…

wan_traffic_huge_spike

I wish my connection was that fast in reality! :rofl:

1 Like