From today’s training session: what is the performance of the telegraf snmp collection bound by? how many snmp objects can it collect and store in influx in given time?
In Telegraf 1.4, collection increases linearly with respect to number of remote agents and the number of fields collected. In the upcoming 1.5 release remote agents will be collected concurrently and should increase based on the number of fields collected only.
The number of objects that can be collected primarily depends on the speed of your snmp devices and network.
Hi Daniel, was the concurrency collection of remote agents released for the snmp inputs plugin?
Yes, this is included in 1.5 and newer.
Thank you Daniel for confirming. So just to be clear, if we configure 3000 devices then there will be that many concurrent threads?
I would be interested to know how to size the VM for memory with this in mind. I read an old thread with your note on metric_buffer_limit, however it did not mention how this increases with an increase in remote agents.
There would be 3000 goroutines mapped onto a thread pool by the Go runtime. Each goroutine would need to allocate temporary space for receiving from it’s SNMP agent. The best way to size is empirically: start small and then double the number of agents and observe the change. You can use the
internal input to watch Telegraf’s memory usage.
The metric_buffer_limit sets the upper limit for metric memory storage during failures on a per output plugin basis, so multiply it by the number of outputs if you will use more than one.
Ok great. Thanks again. I will run my tests.
Do we know the size of the thread pool? I am also assuming that the Go runtime is configured to use multiple logical processors?
I believe it is based on the number of processors on your system, yes Go will use multiple processors. The goroutines are scheduled to run only when they have work available, and don’t consume a thread when they are blocked on network calls.
I could see a huge different on SNMP performance based on the CPU used. For example, a VM with 2 cores i7 could easily complete a full SNMP routine on a 10s polling setup, while using it on a raspberry Pi3 was barely able to complete the routine in a minute polling (both based on centos 7 with docker containers).
Is there a way to use the --test option and get the time it took for the routine to complete? It would help to find the best polling interval to set for an instance.
Closest thing to this would be using the
internal input, it will produce metrics with the gather time for each plugin.
On a somewhat related note, I am also using the snmp input on several thousand devices (I have multiple containers running the collections). I would like to add additional tags to each agent. To achieve this, my understanding is that I would have to have a separate input section for each agent I am collecting from. Will this have significant performance implications compared to passing in a large ‘agent’ list? Or does telegraf optimize that somehow.
There are some tradeoffs, I expect it will use more memory when split out into separate configs. However, I don’t know the exact details, if you do make this transformation it would be nice if you could enable the
internal input and take some measurements before and after the change.
One other thing to consider is that, depending on what sort of tags you are adding, it might be feasible to use a processor for tagging instead.
Thanks @daniel. Started to look into the out of box processors. Is it possible somehow to do something like below:
[inputs.snmp.tagpass] agent_host = ["agent1"] [inputs.snmp.tags] newkey1 = newtag1 newkey2 = newtag2 agent_host = ["agent2"] [inputs.snmp.tags] newkey1 = newtag3 newkey2 = newtag4
Seems to work if I had just one ‘agent_host’ section. Is there a way to chain these (the documentation mentions Excluded metrics are passed downstream to the next processor.)?
If not, will go back and do some testing with a 1:1 agent/input section (and turn on the internal input )
You could do this, but if you need to define many of these processors then I
think you will be better off with the 1:1 agent/input for performance reasons.
[[processors.override]] namepass = "snmp" [processors.override.tagpass] agent_host = ["agent1"] [processors.override.tags] newkey1 = newtag1 newkey2 = newtag2 [[processors.override]] namepass = "snmp" [processors.override.tagpass] agent_host = ["agent2"] [processors.override.tags] newkey1 = newtag3 newkey2 = newtag4