Best way to scale Telegraf

I am using Telegraf with the snmp plugin to monitor network equipments. As for now, I only poll a few devices and it works fine. I am looking to scale it to hundreds of devices and I would like to know what would be the best way to scale it. I am currently using the --config-folder option to manage a seperate config file for the snmp input settings. Should I consider to go on and simply add more and more .conf files or at some point I should consider running multiple telegraf instances?

I would think about SPOF (single points of failure). Having one telegraf instance polling a lot of equipment might create one. If equipment has native SNMP capabilities, I would think maybe pushing data out of the equipment into a pool of telegraf instances could be a better solution? Thinking about failure modes is difficult because you’re dealing with a lot of unknowns, but hopefully this helps

There are advantages to defining multiple snmp plugins instead of one plugin with multiple agents, it will prevent a slow agent from blocking collections with other agents:

  # snip
  # snip

Multiple processes are mostly useful only once you reach the limit of how fast you can send to the outputs. You may also need multiple Telegraf’s on separate systems if you saturate the network connection or the CPU.

When it comes to high availability, I don’t think the type of architecture that @exabrial suggests will be possible with SNMP due to polling design. You could do double collections but that may introduce a lot of extra load so I would probably avoid it.

Thanks! HA / SPOF is not a concern in my case, so I am not worried about it.
I started using the “–config-directory telegraf.d” option where I create a unique SNMP input per host, so it kinda align with your recommendation.
Thanks for your recommendations, next step I guess is to figure out how many inputs I can configure for a single / how many measurements I can send on a single telegraf instance :slight_smile: