Telegraf SNMP data collection for fleet of devices

Need advice on how to manage snmp way of extracting metrics from the fleet (~2K devices).

Do I need to split these devices into small chunks and run multiple telegraf instances probably in docker containers? what are the advantages of this approach instead of running all devices using one telegraf instance using one configuration file with IPs of remote devices in agents: [""] under input.snmp plugin.

Please help in clarifying

Hi @revanth

I don’t think there is any need to split the devices because of limitations in Telegraf specifically, but you might like to split your config up into separate files to make it more manageable.

If you wanted to split the config files up, this issue has a pretty good example and explanation of what you might like to do:

However - if you are finding that it takes too long to walk the MIBs for thousands of devices then I can see that running multiple Telegraf instances in parallel would speed things up for you.

In my experience, SNMP management interfaces on devices are fickle and you might find that some devices respond more reliably and more quickly than others, so perhaps separating your slower responding devices out in to their own config could allow you to poll the more responsive devices more often.

In summary, the trade off here is between ease of maintaining config file(s) vs unreliable devices holding up the gathering of metrics for your faster responding devices.
There is nothing in Telegraf which would prevent you from setting things up in either of these ways.

I would start out with a single Telegraf instance and multiple config files and if that doesn’t perform reliably for you because of unstable devices, split those off in to a different Telegraf set up.

Cheers, Will

Thank you @willcooke for the detailed explanation and reference, I am able to absorb from the git reference but want to make sure my understandings are correct. If you do not mind, can you please clarify further:

  1. One telegraf config file with multiple devices/IPs to monitor, for example agents = [ "IP1", "IP2", "IP3" ], does telegraf fetch metrics parallelly from each IP or sequential?

  2. Creating multiple telegraf configuration files in telegraf.d directory, will telegraf create a separate thread for each file to run in parallel or process each file one after the other.

Thanks again,

As I understand it they are called sequentially. Under the covers it uses net-snmp tools.

Telegraf will take all of those separate files and combine them into a single config file internally. They are only separated logically to make it easy to manage.