Telegraf SNMP data collection for fleet of devices

Need advice on how to manage snmp way of extracting metrics from the fleet (~2K devices).

Do I need to split these devices into small chunks and run multiple telegraf instances probably in docker containers? what are the advantages of this approach instead of running all devices using one telegraf instance using one configuration file with IPs of remote devices in agents: [""] under input.snmp plugin.

Please help in clarifying

Hi @revanth

I don’t think there is any need to split the devices because of limitations in Telegraf specifically, but you might like to split your config up into separate files to make it more manageable.

If you wanted to split the config files up, this issue has a pretty good example and explanation of what you might like to do: Telegraf Configuration - Recommended approach for multiple .conf files? · Issue #6334 · influxdata/telegraf · GitHub

However - if you are finding that it takes too long to walk the MIBs for thousands of devices then I can see that running multiple Telegraf instances in parallel would speed things up for you.

In my experience, SNMP management interfaces on devices are fickle and you might find that some devices respond more reliably and more quickly than others, so perhaps separating your slower responding devices out in to their own config could allow you to poll the more responsive devices more often.

In summary, the trade off here is between ease of maintaining config file(s) vs unreliable devices holding up the gathering of metrics for your faster responding devices.
There is nothing in Telegraf which would prevent you from setting things up in either of these ways.

I would start out with a single Telegraf instance and multiple config files and if that doesn’t perform reliably for you because of unstable devices, split those off in to a different Telegraf set up.

Cheers, Will

Thank you @willcooke for the detailed explanation and reference, I am able to absorb from the git reference but want to make sure my understandings are correct. If you do not mind, can you please clarify further:

  1. One telegraf config file with multiple devices/IPs to monitor, for example agents = [ "IP1", "IP2", "IP3" ], does telegraf fetch metrics parallelly from each IP or sequential?

  2. Creating multiple telegraf configuration files in telegraf.d directory, will telegraf create a separate thread for each file to run in parallel or process each file one after the other.

Thanks again,
Revanth

As I understand it they are called sequentially. Under the covers it uses net-snmp tools. http://www.net-snmp.org/

Telegraf will take all of those separate files and combine them into a single config file internally. They are only separated logically to make it easy to manage.

@willcooke @revanth I’m sorry to interfere but I have a small question regarding this matter, I’m monitoring my network devices using telegraf/snmp and as we have a lot of IPs is there a way where i can import the agents’ IP from a text file instead of adding each IP at a time. To be more clear what i want is instead of having :
agents = [ “IP1”, “IP2”, “IP3” ]`
is there a way to have it like this:
agents = [ include /etc/telegraf/lisIP.text ] ?

@zaki I am not aware of such flexibility, may be you can open a separate topic with this question. I will be interested to know as well.

Did you ever figure this out? I have the same question…

even though telegraf can pull the data, for nearly 2k devices, I’d probably buy or develop a process to manage adding/removing devices from the pull process or else your editing telegraf configs.

if you could pull the data via an app or script and dump it into a database or separate json files and then just ingest it via telegraf. You’d offload all that snmpget traffic to another system and then use a single telegraf config/process to just ingest the data.

python supports multithreading, and with an snmp module, you could do asynch pulls from the devices, which would speed up your process.

Just an idea that is a bit more scalable and has more visibility. you can write in many safeguards and checks with a py script.