Getting meaningful error data from Telegraf

Hey everyone!

I’m new here and to Telegraf too. I’m struggling to find out how to get meaningful error data from Telegraf to display in our SNMP dashboard.

For example, I’d like to get data about unreachable devices (SNMP input), due to them being down. I’d like to know what exact agent has issues and what the issues are.

I tried to get this info from inputs.internal plugin, but all I can find is in “internal_gather”, there is the number of errors increasing. It doesn’t say anything else, just number that is incrementing. I’d like to see at least the agent failing, but ideally also error message etc.

Is there any solution you can think of how to solve this please? :pray:

Here is my (for now simple, 1 agent) config:

[agent]
  interval = "1m"

[[inputs.snmp]]
  path = ["/usr/share/snmp/mibs"]
  agents = ["snmp_simulator:161"]
  timeout = "5s"
  version = 2
  community = "public"

  [[inputs.snmp.field]]
    oid = "1.3.6.1.4.1.4096.10000.1.1"
    name = "utilLoad"

[[inputs.internal]]
  collect_memstats = true
  collect_gostats = false

[[outputs.postgresql]]
  connection="host=db port=5432 user=admin password=admin sslmode=disable dbname=db"

I got a reply form @Hipska in the community Slack.

He suggests using hostname as a value of the alias property under [[inputs.snmp]]. If you use Internal Input Plugin, then this alias will then appear in internal_gather table. This way, you can check what hostname errored.

Also, when determining failures, one can look at the SNMP objects such as Uptime and if the data is not available for the hostname after the last SNMP poll, most likely there is something wrong with the device.

He uses combination of both, plus Ping Input Plugin to get even more information on devices’ health.

An image he provided:

Thank you Hipska once again!

1 Like