Hello team:
I use telegraf to poll 150 network devices once each 300 seconds and push the collected information to influxDB. One configuration file per device with SNMP (CPU, RAM, uptime, interfaces) and PING plugins .
I was asked if I can extend the same polling to 800 more devices.
I do not feel confident. ¿ Will a single telegraf docker be able to sustain almost 1,000 devices once each 5 minutes? If in average telegraf takes 1 second per device to process the entire cycle of collection and storage in influx, then this request would not be possible to fulfill.
¿ Would you please kindly point me -if available- to documentation that discusses the scalability of telegraf?
Any hints will be greatly appreciated.
Best regards
The scalability and performance is dependent on many things. From the host running telegraf, including the cpu, memory, and storage amounts and speed, to how busy the network you are running on, the devices you are querying, and even the settings in telegraf, like the interval setings. Because of all these variables there are no hard recommendations.
That said if you haven’t seen the SNMP best practices blog post, please give it a read. The author uses thousands of SNMP devices successfully.
I was asked if I can extend the same polling to 800 more devices.
My suggestion would be to read the above blog post, and start increasing the number of devices. Ramp up to your 1000. At worst you might run into an issue that might require you to split your config between two telegraf, or two hosts.