I’ve been using telegraf in our lab to collect SNMP data from some Nexus switches, UCS Fabric interconnects, and MDS SAN switches. Worked great. Deployed to production and the queries that were taking less than a second are now taking 45+ seconds. The production nexus 7ks have hundreds of interfaces and I’m worried the large list of switches i’m providing to telegraf is being processed serially. I’m just pulling the interface stats via the IF-NET MIB, nothing fancy. I’m running telegraf in a k8s container deployed via helm.
Before I visit other SNMP collectors or try and create my own helm chart for telegraf that blows out new single instance containers for each snmp server being monitored, I figured I’d ask here to see what folks are doing at scale. Is there a magical parallel setting I’m missing or a magical autoscale set for k8s that I’m ignorant of?