Telegraf data collection not working with Grafana after update

I have been battling an issue for quite some time that I have been unable to resolve as of yet. The issue that I am having is that if I upgrade my telegraf install from version 1.5.3 to something newer I am no longer seeing any data within Grafana.

I was running Ubuntu version 18.04 and decided it was time to upgrade to version 22.04. Along with this dist upgrade, I decided to upgrade influxdb, grafana and telegraf to the latest versions even though I knew that I would no longer see data within Grafana.

I am hoping this community can point me in the right direction as to what is going on and why the upgrade of telegraf from version 1.5.3 to version 1.24.0 is causing my data to no longer show up in Grafana.

I am struggling on how to debug this issue and confirm that telegraf is collecting the data and sending it to influxdb properly?

Here are some highlights. I can clearly see in Grafana that I am no longer seeing data on the graphs from the time the upgrade took place. In the past, I have upgraded to different versions of telegraf and had the same issue and I always ended up reverting back to telegraf version 1.5.3. This time I am determined to proceed with the update.

I’ve modified the telegraf.conf file to send the metrics to stdout as well as to a file to confirm that Telegraf is still collecting the information that its configured to do so. Below is a small snippet of the output seen in the output file.
“snmp,agent_host=192.168.0.5,host=ubuntu,hostname=SW-2960-8.gotti.net,ifAlias=Gotti-Desktop,ifDescr=GigabitEthernet0/8,ifName=Gi0/8 ifLinkUpDownTrapEnable=1i,ifCounterDiscontinuityTime=0i,ifInBroadcastPkts=8449719i,ifOutMulticastPkts=17461705i,ifHCInOctets=6829494354993i,ifHCInMulticastPkts=3856540i,ifPromiscuousMode=2i,ifConnectorPresent=1i,ifInMulticastPkts=3856540i,ifHCInUcastPkts=10544020938i,ifHCInBroadcastPkts=8449719i,ifHCOutOctets=6777258262154i,ifHCOutMulticastPkts=17461705i,ifOutBroadcastPkts=7451597i,ifHCOutUcastPkts=9116329503i,ifHCOutBroadcastPkts=7451597i,ifHighSpeed=1000i 1663031490000000000”

I’ve also tried to debug influxdb to see if its receiving data and as far as I can tell it is. I am suspecting that the metrics that it is receiving from telegraf is going into a different database table or measurement, but I don’t know enough to confirm. Below is some output that I have collected from influxdb, both in a working state and a non working state:

Working State:
“Sep 12 21:32:32 ubuntu influxd-systemd-start.sh[66113]: ts=2022-09-13T01:32:32.444588Z lvl=info msg=“Executing query” log_id=0cu72hml000 service=query query=“SELECT non_negative_derivative(mean(ifHCOutOctets), 1s) * 8 FROM telegraf.autogen.snmp WHERE (agent_host = ‘192.168.0.1’ AND ifName = ‘em1’) AND time >= 461764h AND time <= 1662436799999ms GROUP BY time(15m)”
Sep 12 21:32:32 ubuntu influxd-systemd-start.sh[66113]: ts=2022-09-13T01:32:32.457232Z lvl=info msg=“Executing query” log_id=0cu72hml000 service=query query=“SELECT last(ifHCInOctets) FROM telegraf.autogen.snmp WHERE (agent_host = ‘192.168.0.1’ AND ifName = ‘em1’) AND time >= 461764h AND time <= 1662436799999ms GROUP BY time(15m)””

Non Working State:
Sep 12 21:30:38 ubuntu influxd-systemd-start.sh[66113]: ts=2022-09-13T01:30:38.455258Z lvl=info msg=“Executing query” log_id=0cu72hml000 service=query query=“SELECT last(ifHCInOctets) FROM telegraf.autogen.snmp WHERE (agent_host = ‘192.168.0.1’ AND ifName = ‘em1’) AND time >= now() - 30m AND time <= now() GROUP BY time(20s)”
Sep 12 21:30:38 ubuntu influxd-systemd-start.sh[66113]: ts=2022-09-13T01:30:38.460442Z lvl=info msg=“Executing query” log_id=0cu72hml000 service=query query=“SELECT non_negative_derivative(mean(ifHCInOctets), 1s) * 8 FROM telegraf.autogen.snmp WHERE (agent_host = ‘192.168.0.1’ AND ifName = ‘em1’) AND time >= now() - 30m AND time <= now() GROUP BY time(20s)”

Finally, below is the conf file that I am using to poll the network device that I am collecting SNMP data from.
[[inputs.snmp]]
agents = [ “192.168.0.1:161” ]
version = 2
community = “redacted”
name = “snmp”

[[inputs.snmp.field]]
name = “hostname”
oid = “RFC1213-MIB::sysName.0”
is_tag = true

[[inputs.snmp.table]]
name = “snmp”
inherit_tags = [ “hostname” ]
oid = “IF-MIB::ifXTable”

[[inputs.snmp.table.field]]
  name = "ifName"
  oid = "IF-MIB::ifName"
  is_tag = true

Current versions installed:
Telegraf 1.24.0 (git: HEAD@3c4a6516)
InfluxDB shell version: 1.8.10
Grafana v9.1.4 (2186d0bbeb)

I am sure I am leaving out some important information, please let me know what I need to provide. I appreciate anyone’s help in resolving my issue.

Running the below command to follow the influxdb log shows the following:

gotti@ubuntu:/etc/telegraf/telegraf.d$ sudo journalctl -u influxdb.service -n 20 -f
Sep 13 07:11:16 ubuntu influxd-systemd-start.sh[66113]: [httpd] 192.168.0.49 - - [13/Sep/2022:07:11:16 -0400] "POST /write?db=telegraf HTTP/1.1 " 204 0 “-” “telegraf” c9511990-3354-11ed-9afc-00505683f2ff 26290
Sep 13 07:11:16 ubuntu influxd-systemd-start.sh[66113]: [httpd] 192.168.0.49 - - [13/Sep/2022:07:11:16 -0400] "POST /write?db=telegraf HTTP/1.1 " 204 0 “-” “telegraf” c9525140-3354-11ed-9afd-00505683f2ff 20416

Below is output from the log for telegraf:

gotti@ubuntu:/etc/telegraf/telegraf.d$ sudo journalctl -u telegraf.service -n 20 -f
Sep 12 21:11:46 ubuntu systemd[1]: telegraf.service: Consumed 3.839s CPU time.
Sep 12 21:11:46 ubuntu systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB…
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z W! DeprecationWarning: Option “address” of plugin “inputs.http_response” deprecated since version 1.12.0 and will be removed in 2.0.0: use ‘urls’ instead
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z W! DeprecationWarning: Option “address” of plugin “inputs.http_response” deprecated since version 1.12.0 and will be removed in 2.0.0: use ‘urls’ instead
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z W! DeprecationWarning: Option “address” of plugin “inputs.http_response” deprecated since version 1.12.0 and will be removed in 2.0.0: use ‘urls’ instead
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z W! DeprecationWarning: Option “address” of plugin “inputs.http_response” deprecated since version 1.12.0 and will be removed in 2.0.0: use ‘urls’ instead
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Starting Telegraf 1.24.0
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Loaded inputs: cpu disk diskio http_response (4x) kernel mem ping processes snmp (4x) swap system
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Loaded aggregators:
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Loaded processors:
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Loaded outputs: prometheus_client
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! Tags enabled: host=ubuntu
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z W! Deprecated inputs: 0 and 4 options
Sep 12 21:11:47 ubuntu systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 12 21:11:47 ubuntu telegraf[64502]: 2022-09-13T01:11:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:“ubuntu”, Flush Interval:10s
Sep 12 21:11:48 ubuntu telegraf[64502]: 2022-09-13T01:11:48Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
Sep 12 21:45:42 ubuntu telegraf[64502]: 2022-09-13T01:45:42Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 10s
Sep 13 00:46:30 ubuntu telegraf[64502]: 2022-09-13T04:46:30Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
Sep 13 05:48:30 ubuntu telegraf[64502]: 2022-09-13T09:48:30Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s

I’ve enabled debugs via the telegraf.conf file with the below changes:

[agent]
debug = true

While following the log I am not seeing anything related to the SNMP polling:

gotti@ubuntu:/etc/telegraf/telegraf.d$ sudo journalctl -u telegraf.service -n 20 -f
Sep 13 07:21:54 ubuntu telegraf[138001]: 2022-09-13T11:21:54Z D! [outputs.prometheus_client] Wrote batch of 76 metrics in 1.037615ms
Sep 13 07:21:54 ubuntu telegraf[138001]: 2022-09-13T11:21:54Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/1000/doc”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/1000/gvfs”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/121/doc”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/121/gvfs”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/35330a06e663”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/e67d4e6a8dff”): permission denied
Sep 13 07:22:00 ubuntu telegraf[138001]: 2022-09-13T11:22:00Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/ca93faf99f05”): permission denied
Sep 13 07:22:04 ubuntu telegraf[138001]: 2022-09-13T11:22:04Z D! [outputs.prometheus_client] Wrote batch of 76 metrics in 1.806794ms
Sep 13 07:22:04 ubuntu telegraf[138001]: 2022-09-13T11:22:04Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/1000/doc”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/1000/gvfs”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/121/doc”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/user/121/gvfs”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/35330a06e663”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/e67d4e6a8dff”): permission denied
Sep 13 07:22:10 ubuntu telegraf[138001]: 2022-09-13T11:22:10Z D! [inputs.disk] [SystemPS] => unable to get disk usage (“/run/docker/netns/ca93faf99f05”): permission denied

This is a huge upgrade with thousands of changes between versions.

Ok so telegraf is successfully collecting data

You mention sending data to influxdb, but in your logs, the only loaded output is prometheus_client. There is no influxdb output loaded or going to send any data. The second log message confirms it is writing some metrics that were collected to the prometheus_client output.

While it does look like a telegraf is writting to influxdb, it isn’t from your telegraf config or agent you showed logs from. Sure you don’t have another instance running somewhere else?

You should double check your config to ensure you have influxdb in it and that you are passing that config file to telegraf. By default it will only look at /etc/telegraf/telegraf.conf.

Thank you very much for your response. I took a look at my config file located in /etc/telegra/telegraf.conf and I found that all parts of the influxdb section were commented out and only the prometheus portion of the config file was uncommented.

I am not using prometheus so I simply commented them out and reconfigured the influxdb portion and restarted telegraf.

I am happy to report that everything is working now. Thanks again for your help!

1 Like