Best way to scale Telegraf

Touchedegris · October 2, 2018, 6:05pm

Hi,
I am using Telegraf with the snmp plugin to monitor network equipments. As for now, I only poll a few devices and it works fine. I am looking to scale it to hundreds of devices and I would like to know what would be the best way to scale it. I am currently using the --config-folder option to manage a seperate config file for the snmp input settings. Should I consider to go on and simply add more and more .conf files or at some point I should consider running multiple telegraf instances?

exabrial · October 3, 2018, 2:17pm

I would think about SPOF (single points of failure). Having one telegraf instance polling a lot of equipment might create one. If equipment has native SNMP capabilities, I would think maybe pushing data out of the equipment into a pool of telegraf instances could be a better solution? Thinking about failure modes is difficult because you’re dealing with a lot of unknowns, but hopefully this helps

daniel · October 3, 2018, 8:36pm

There are advantages to defining multiple snmp plugins instead of one plugin with multiple agents, it will prevent a slow agent from blocking collections with other agents:

[[inputs.snmp]]
  # snip
[[inputs.snmp]]
  # snip

Multiple processes are mostly useful only once you reach the limit of how fast you can send to the outputs. You may also need multiple Telegraf’s on separate systems if you saturate the network connection or the CPU.

When it comes to high availability, I don’t think the type of architecture that @exabrial suggests will be possible with SNMP due to polling design. You could do double collections but that may introduce a lot of extra load so I would probably avoid it.

Touchedegris · October 4, 2018, 12:46am

Thanks! HA / SPOF is not a concern in my case, so I am not worried about it.
I started using the “–config-directory telegraf.d” option where I create a unique SNMP input per host, so it kinda align with your recommendation.
Thanks for your recommendations, next step I guess is to figure out how many inputs I can configure for a single / how many measurements I can send on a single telegraf instance

Topic		Replies	Views
Monitoring an application running on several servers Telegraf telegraf , smnp , grafana	2	921	September 7, 2017
Telegraf SNMP data collection for fleet of devices Telegraf telegraf , smnp , performance	7	2583	August 10, 2021
Telegraf - scale out SNMP collectors? Telegraf telegraf	0	747	November 14, 2018
Best deployment strategy for production, 1x Telegraf per NE or per NE hardware type? Telegraf performance , kafka , snmp	5	95	December 18, 2024
Multiple telegraf instances without polling the same device Telegraf telegraf	0	518	March 16, 2021

Best way to scale Telegraf

Related topics