SNMP plugin - error reading from socket

PabloFraire · September 19, 2024, 8:53pm

Hi!

I am having this error:

[inputs.snmp] Error in plugin: agent udp://192.168.20.103:161: performing get on field SNMPv2-MIB::sysName: error reading from socket: read udp 192.168.2.13:60574->192.168.20.103:161: recvfrom: connection refused

And couldn’t figure out what is happening.

Sometimes from that moment the agent could not be reached (even with snmpget command) until telegraf service is restarted.

After telegraf restart, snmpget command works fine and no error is logged until it happens again.

Telegraf 1.32.0 (the problem occurred also with 1.31.2)

OS: “SUSE Linux Enterprise Server 12 SP5”

I have more than 1000 metrics, each with its own config file more or less like this one:

[[inputs.snmp]]
  agents = ["udp://192.168.20.103:161"]

  timeout = "15s"
  version=3

  ## SNMPv3 authentication and encryption options.
  ##

  retries = 3
  max_repetitions = 10
  interval = "24h"
  collection_jitter = "1h"

  agent_host_tag = "source"
  tagexclude = ["host"]

  name = "SNMPv2-MIB::system"

  [[inputs.snmp.field]]
    oid = ".1.3.6.1.2.1.1.1.0"
    name = "SNMPv2-MIB::sysDescr"
    is_tag = true

etc...

As I have a 1 day collection interval I set the collection jitter in 1 hour to reduce simultaneity.

Any clue?

Regards!

srebhan · September 20, 2024, 10:22am

Well as the message says, Telegraf can not establish a UDP “connection” to the given host/port! Packetfilter? Typo in IP or port? Authentication missing or wrong?

PabloFraire · September 20, 2024, 11:06am

Hi @srebhan !

The connection works perfect until fails.

I run it with --test option an works fine, here is a partial output:

> SNMPv2-MIB::system,SNMPv2-MIB::sysContact=Backend\ de\ Aseguramiento,SNMPv2-MIB::sysDescr=Hardware\ management\ system,SNMPv2-MIB::sysLocation=ARBBCWP\ EL\ TALAR\ RUTA\ PANAMERICANA\ 32750,SNMPv2-MIB::sysName=ARBBCWP_GDE_HAD99-LOM,SNMPv2-MIB::sysObjectID=hwServer,source=192.168.20.103 IF-MIB::ifNumber=6i,SNMPv2-MIB::sysUpTime=18.55 1726828958000000000
> HUAWEI-SERVER-IBMC-MIB::temperatureDescriptionTable,SNMPv2-MIB::sysName=ARBBCWP_GDE_HAD99-LOM,index=1,source=192.168.20.103 HUAWEI-SERVER-IBMC-MIB::temperatureLowerCritical=65535i

It fails at random, and the block seems to happen after some error during connection or disconnection.

I have 45 measurements for each device, each device and measurements in its own config file.

Running telegraf using this server configuration files isolated in test directory, did not fail.

Any way to log more detailed info, to debug in production?

Regards!

PabloFraire · September 20, 2024, 1:33pm

I got the problem so I did this tests:

ARBBCWP_GDE_DSS01:/home/paas/telegraf/test # ping 192.168.20.103
connect: Resource temporarily unavailable
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test # snmpget -v3 -Oa -l authPriv -u  -a SHA -A "" -x AES -X "" 192.168.20.103 SNMPv2-MIB::sysName.0
snmpget: Unknown engine ID (Resource temporarily unavailable)
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test # kill -KILL %1
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test #
[1]+  Killed                  /usr/bin/telegraf -config /home/paas/telegraf/conf/telegraf.conf -config-directory /home/paas/telegraf/test --once
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test #  ping 192.168.20.103
PING 192.168.20.103 (192.168.20.103) 56(84) bytes of data.
64 bytes from 192.168.20.103: icmp_seq=1 ttl=64 time=0.301 ms
^C
--- 192.168.20.103 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.301/0.301/0.301/0.000 ms
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test # snmpget -v3 -Oa -l authPriv -u  -a SHA -A "" -x AES -X "" 192.168.20.103 SNMPv2-MIB::sysName.0
SNMPv2-MIB::sysName.0 = STRING: ARBBCWP_GDE_HAD99-LOM
ARBBCWP_GDE_DSS01:/home/paas/telegraf/test #

PabloFraire · September 20, 2024, 1:53pm

More tests:

Runing like this:

/usr/bin/telegraf -config /home/paas/telegraf/conf/telegraf.conf -config-directory /home/paas/telegraf/test --once

I got the problem.

But runnig like this:

/usr/bin/telegraf -config /home/paas/telegraf/conf/telegraf.conf -config-directory /home/paas/telegraf/test --test

I got all the measurements without any problem.

Regards!

PabloFraire · September 23, 2024, 11:59am

Hi!

After many, many test I got to this.

Looks like some limit in my SO configuration.

If while running telegraf I check netstat, I got one of this lines for each inputs.snmp plugin:

2024-09-23T11:14:33Z I! Starting Telegraf 1.32.0 brought to you by InfluxData the makers of InfluxDB
2024-09-23T11:14:33Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
2024-09-23T11:14:33Z I! Loaded inputs: snmp (4967x)

udp        0      0 ECM_ETH0_IP:64996       192.168.3.101:snmp      ESTABLISHED
udp        0      0 ECM_ETH0_IP:64997       192.168.3.111:snmp      ESTABLISHED
udp        0      0 ECM_ETH0_IP:64998       192.168.1.108:snmp      ESTABLISHED
udp        0      0 ECM_ETH0_IP:64999       192.168.4.108:snmp      ESTABLISHED
udp        0      0 ECM_ETH0_IP:65000       192.168.5.105:snmp      ESTABLISHED
ARBBCWP_GDE_DSS01:/var/log/telegraf # netstat | grep snmp | wc -l
4967

It looks like in my SO the limit is 5000, because when I put more than that number of “inputs.snmp” I got the problem.

I guess that running with the “–once” option all the connections are set at the same time (ignoring interval an collection_jitter), so that’s the reason I reach the limit immediately.

Running without that option, “collection_jitter” reduces the problem.

I´ll set the “collection_jitter” to “2h” and check tonight what happens.

Another alternative is to find out where this 5000 limit is an increase it, but I am not owner of SO so it would be more complicated.

Regards!

PabloFraire · September 24, 2024, 10:23am

Hi!
@srebhan finally I found the root cause here:

Why does ping fail with the error “connect: Resource temporarily unavailable”?

This is my SO configuration:
net.ipv4.ip_local_port_range = 60000 65000

That leaves 5001 connection ports.
My final configuration would have around 15000 inputs.snmp.

My question now is: is it correct to have taken all the udp ports all the time telegraf is running?
Cannot them be released?

Should I open a different topic for this question?

Regards!

srebhan · October 2, 2024, 8:36am

Did you try to set unconnected_udp_socket = true?

PabloFraire · October 3, 2024, 10:01am

Hi @srebhan !

I set all my plugins with interval = “24h” with

unconnected_udp_socket = true

and worked, this morning the ports are free.

But the problem is still there.

At midnight when all the “24h” plugins start, they use all the port available (5001) and the next plugin has no port.

I have already get the OK to increase net.ipv4.ip_local_port_range (It seems that it has none or very low impact), but I’ll try one more test using the offset value to split the start time of “24h” plugins in order to let some finish before the rest start and keeping the count under 5000 active plugins.

Regards!

PabloFraire · October 4, 2024, 9:29am

Well, it didn´t work.

It seems that although the ports are freed after use, Telegraf starts all plugins at once not considering the “collection_offset” setting therefore at 00:00 when the “24h” interval starts, one port per plugin is needed.

I set net.ipv4.ip_local_port_range = 1024 65535 so now there are plenty ports available, and having set unconnected_udp_socket = true they would be freed after use.

Let´s see how it works.

Regards!

srebhan · October 7, 2024, 10:22am

I think this was solved in the other thread by setting unconnected_udp_socket = true.

PabloFraire · October 7, 2024, 12:23pm

Hi @srebhan !

The other thread clarified the setting, and worked on the release side.

But at midnight when all the “24h” plugins start, they use all the port available, or so it seems.
After running the port are released, but from 00:00 I get the error.

I am still working on tunning these parameters:

net.ipv4.ip_local_port_range
LimitNOFILE

for running without errors my:

ping (6x) snmp (11021x)

Regards!

srebhan · October 7, 2024, 1:10pm

That’s probably because all plugins start at the same time. You can either set the collection_jitter setting in the agent section to randomize the collection timing or set the collection_offset setting per input plugin to have full control over the timing of the different plugins.

PabloFraire · October 8, 2024, 10:55am

I did that last week, but I set both settings collection_jitter and collection_offset at plugin level.
Is that wrong?
I will try again and see what happens.

srebhan · October 8, 2024, 12:40pm

Both can be set on a per-plugin level but I would decide for one or the other. So either set collection_jitter in the agent section to randomize the point in time gather will be called for the different plugins, OR set different collection_offsets on plugins to deterministically order execution of plugins.

Topic		Replies	Views
SNMP input plugin - all available ports taken and not released Telegraf telegraf	3	42	October 3, 2024
Telegraf SNMP Error - performing bulk walk for field field-name: request timeout (after 3 retries) Telegraf telegraf , snmp	4	1314	April 21, 2023
Telegraf_SNMP_Trap Telegraf	45	2227	August 31, 2023
Unable to collect data from servers over SNMP using Telegraf Telegraf	6	4396	September 12, 2017
SNMP reply does not reach telegraf Telegraf telegraf	0	589	April 22, 2021

SNMP plugin - error reading from socket

Related topics