[inputs.snmp] Error in plugin: agent udp://192.168.20.103:161: performing get on field SNMPv2-MIB::sysName: error reading from socket: read udp 192.168.2.13:60574->192.168.20.103:161: recvfrom: connection refused
And couldn’t figure out what is happening.
Sometimes from that moment the agent could not be reached (even with snmpget command) until telegraf service is restarted.
After telegraf restart, snmpget command works fine and no error is logged until it happens again.
Telegraf 1.32.0 (the problem occurred also with 1.31.2)
OS: “SUSE Linux Enterprise Server 12 SP5”
I have more than 1000 metrics, each with its own config file more or less like this one:
Well as the message says, Telegraf can not establish a UDP “connection” to the given host/port! Packetfilter? Typo in IP or port? Authentication missing or wrong?
If while running telegraf I check netstat, I got one of this lines for each inputs.snmp plugin:
2024-09-23T11:14:33Z I! Starting Telegraf 1.32.0 brought to you by InfluxData the makers of InfluxDB
2024-09-23T11:14:33Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2024-09-23T11:14:33Z I! Loaded inputs: snmp (4967x)
udp 0 0 ECM_ETH0_IP:64996 192.168.3.101:snmp ESTABLISHED
udp 0 0 ECM_ETH0_IP:64997 192.168.3.111:snmp ESTABLISHED
udp 0 0 ECM_ETH0_IP:64998 192.168.1.108:snmp ESTABLISHED
udp 0 0 ECM_ETH0_IP:64999 192.168.4.108:snmp ESTABLISHED
udp 0 0 ECM_ETH0_IP:65000 192.168.5.105:snmp ESTABLISHED
ARBBCWP_GDE_DSS01:/var/log/telegraf # netstat | grep snmp | wc -l
4967
It looks like in my SO the limit is 5000, because when I put more than that number of “inputs.snmp” I got the problem.
I guess that running with the “–once” option all the connections are set at the same time (ignoring interval an collection_jitter), so that’s the reason I reach the limit immediately.
Running without that option, “collection_jitter” reduces the problem.
I´ll set the “collection_jitter” to “2h” and check tonight what happens.
Another alternative is to find out where this 5000 limit is an increase it, but I am not owner of SO so it would be more complicated.
At midnight when all the “24h” plugins start, they use all the port available (5001) and the next plugin has no port.
I have already get the OK to increase net.ipv4.ip_local_port_range (It seems that it has none or very low impact), but I’ll try one more test using the offset value to split the start time of “24h” plugins in order to let some finish before the rest start and keeping the count under 5000 active plugins.
It seems that although the ports are freed after use, Telegraf starts all plugins at once not considering the “collection_offset” setting therefore at 00:00 when the “24h” interval starts, one port per plugin is needed.
I set net.ipv4.ip_local_port_range = 1024 65535 so now there are plenty ports available, and having set unconnected_udp_socket = true they would be freed after use.
The other thread clarified the setting, and worked on the release side.
But at midnight when all the “24h” plugins start, they use all the port available, or so it seems.
After running the port are released, but from 00:00 I get the error.
That’s probably because all plugins start at the same time. You can either set the collection_jitter setting in the agent section to randomize the collection timing or set the collection_offset setting per input plugin to have full control over the timing of the different plugins.
I did that last week, but I set both settings collection_jitter and collection_offset at plugin level.
Is that wrong?
I will try again and see what happens.
Both can be set on a per-plugin level but I would decide for one or the other. So either set collection_jitter in the agent section to randomize the point in time gather will be called for the different plugins, OR set different collection_offsets on plugins to deterministically order execution of plugins.