Collection took longer than expected; not complete after interval of 10s

steph-ipi-69 · August 16, 2021, 2:12pm

Hello,

I want to use TIG with SQUID proxy server. I am following this “tuto”
GitHub - molu8bits/squid-grafana-monitoring: Monitor Squid Proxy Server using SNMP, Telegraf, Influxdb and view graphs as Grafana dashboard.

I have errors wtih telegraf:
août 16 15:57:10 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:10Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:20 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:20Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:30 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:30Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:30 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:30Z E! [inputs.snmp] Error in plugin: agent myipserver:3401: performing get on field uptime: request timeout (after 3 retries)

But i don’t know exactly how to fix that…i opened 3401 in udp
and tcp (for testing) but not better
so i don’t understand what i do fix here

github.com/influxdata/telegraf

SNMP plugin timeout when the response SRC IP is different than original request (HA Virtual IPs)

opened 05:25PM - 10 Oct 17 UTC

closed 02:02PM - 07 May 21 UTC

derekmwright

bug area/snmp upstream

## Bug report ### Relevant telegraf.conf: ``` [[inputs.snmp]] agents =… [ "remote-hostname" ] community = "public" name = "system" [[inputs.snmp.field]] name = "hostname" oid = "1.3.6.1.2.1.1.5.0" is_tag = true ``` ### System info: Ran from Docker Hub (library) Telegraf v1.4.1 (git: release-1.4 2de7aa23d7d3c3bcf639d417129ab7c17d83399b) ### Steps to reproduce: 1. Ensure Remote-Hostname is listening on 161 for SNMP GET 2. Execute container with snmp config mounted: `docker run -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf` 3. Receive error: `2017-10-10T17:18:31Z E! Error in plugin [inputs.snmp]: agent remote-hostname: performing get on field hostname: Request timeout (after 3 retries)` ### Expected behavior: Telegraf should be able to extract the response from the remote host when using OID numbers. ### Actual behavior: Telegraf doesn't see the response from the remote host as valid and then retries the query. ### Additional info: ``` 13:18:30.287374 IP (tos 0x0, ttl 63, id 42471, offset 0, flags [DF], proto UDP (17), length 71) 1.2.3.4.36803 > 1.2.3.5.161: { SNMPv2c { GetRequest(28) R=-1297462436 .1.3.6.1.2.1.1.5.0 } } 0x0000: 4500 0047 a5e7 4000 3f11 3882 0a50 0be1 E..G..@.?.8..P.. 0x0010: 0a02 3d0a 8fc3 00a1 0033 5d81 3029 0201 ..=......3].0).. 0x0020: 0104 0670 7562 6c69 63a0 1c02 04b2 aa4b ...public......K 0x0030: 5c02 0100 0201 0030 0e30 0c06 082b 0601 \......0.0...+.. 0x0040: 0201 0105 0005 00 ....... 13:18:30.288262 IP (tos 0x0, ttl 62, id 63659, offset 0, flags [none], proto UDP (17), length 83) 1.2.3.5.161 > 1.2.3.4.36803: { SNMPv2c { GetResponse(40) R=-1297462436 .1.3.6.1.2.1.1.5.0="remote-hostname" } } 0x0000: 4500 0053 f8ab 0000 3e11 26b1 0a02 3d0b E..S....>.&...=. 0x0010: 0a50 0be1 00a1 8fc3 003f bb81 3035 0201 .P.......?..05.. 0x0020: 0104 0670 7562 6c69 63a2 2802 04b2 aa4b ...public.(....K 0x0030: 5c02 0100 0201 0030 1a30 1806 082b 0601 \......0.0...+.. 0x0040: .... .... .... .... .... .... .... .... .......REMOTE-HOSTNAME 0x0050: 312d 41 ``` (I removed the "actual" hostname from this capture)

I use Virtual Machine with red hat 8 and not docker.

Can u help me please ?
Thks

Anaisdg · August 16, 2021, 6:30pm

Hello @steph-ipi-69,
You may want to experiment with flush_jitter = "5s" or metric_batch_size = 5000 , change one item at a time and measure before and after to judge the impact.
How many metrics are you trying to collect and write?
@popey, do you have any suggestions here as well?

steph-ipi-69 · August 16, 2021, 6:44pm

Hello @Anaisdg

OK
thks for reply.
Do i put theses Line in telegraf.conf right ?
I test tomorow

I want for the moment try to collect around 15 metrics…

Thks a lot

Anaisdg · August 16, 2021, 6:51pm

Hello @steph-ipi-69,
They exist in the agent portion of your config. You might have to increase them. Although it sounds like you wont have to increase the batch size if you’re only tryin to collect around 15 metrics.
15 metrics at what collection interval?

steph-ipi-69 · August 16, 2021, 7:29pm

I dont remember have interval in telegraf.conf

This is telegraf.conf :

[outputs]
[outputs.influxdb]
    url = "http://localhost:8086"
    database = "telegraf"

[[inputs.snmp]]
  agents = [ "YOUR_SQUID_IP_ADDRESS:3401" ]
  version = 2
  community = "public"
  name = "snmpsquid"

 [[inputs.snmp.field]]
    name = "uptime"
    oid = "1.3.6.1.4.1.3495.1.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheVersionId"
    oid = "1.3.6.1.4.1.3495.1.2.3.0"

 [[inputs.snmp.field]]
    name = "cacheMemMaxSize"
    oid = "1.3.6.1.4.1.3495.1.2.5.1.0"
 [[inputs.snmp.field]]
    name = "cacheMemUsage"
    oid = "1.3.6.1.4.1.3495.1.3.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheCpuUsage"
    oid = "1.3.6.1.4.1.3495.1.3.1.5.0"

 [[inputs.snmp.field]]
    name = "cacheSwapMaxSize"
    oid = "1.3.6.1.4.1.3495.1.2.5.2.0"
 [[inputs.snmp.field]]
    name = "cacheCurrentSwapSize"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.14.0"

 [[inputs.snmp.field]]
    name = "cacheClients"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.15.0"

 [[inputs.snmp.field]]
    name = "cacheProtoClientHttpRequests"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.1.0"

 [[inputs.snmp.field]]
    name = "cacheHttpHits"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.2.0"

 [[inputs.snmp.field]]
    name = "cacheHttpErrors"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheHttpInKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.4.0"

 [[inputs.snmp.field]]
    name = "cacheHttpOutKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.5.0"

 [[inputs.snmp.field]]
    name = "cacheServerInKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.12.0"

 [[inputs.snmp.field]]
    name = "cacheServerOutKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.13.0"

steph-ipi-69 · August 16, 2021, 7:32pm

Perhaps in grafana ?

{
“cacheTimeout”: null,
“colorBackground”: true,
“colorValue”: false,
“colors”: [
“#299c46”,
“rgba(237, 129, 40, 0.89)”,
“#d44a3a”
],
“datasource”: “${DS_TELEGRAF}”,
“format”: “s”,
“gauge”: {
“maxValue”: 100,
“minValue”: 0,
“show”: false,
“thresholdLabels”: false,
“thresholdMarkers”: true

steph-ipi-69 · August 17, 2021, 9:15am

Hello;

i changed telegraf config file. i added my value inputs and outputs in default config file.
and i tested and same message “collection took”.
i followed your help and i put your exemple value (5s) and same message but instead of 10s message it’s for 5s :’(
i changed for put more time…

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 5000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "5s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

And now this is the new message…

août 17 11:35:40 l-adm-proxy-01r telegraf[78167]: 2021-08-17T09:35:40Z E! [inputs.snmp] Error in plugin: agent myip:3401: performing get on field uptime: request timeout (after 3 retries)

Giovanni_Luisotto · August 17, 2021, 4:16pm

The interval settings define also the time interval for the whole data gathering process (as you might have noticed).
Now that you set it to 60s, you are one step closer to the source issue as you get a client side timeout error.

Do you get a timeout if you try to get the data “manually” from the same machine used by telegraf?

About what can be the issue I have no idea as network errors can be caused by a huge amount of different causes.

steph-ipi-69 · August 17, 2021, 4:30pm

Hello, thks for reply

I diddnt tested manually…
And i dont Know how to test.

With snmpwalk ?
Thks

steph-ipi-69 · August 18, 2021, 7:44am

Hello everyone,

Shame on me:
my squid conf was not ok.
i puted acl snm_pnet src 127.0.0.1 instead acl snmpnet src 127.0.0.1
thks Giovanni_Luisotto for your help i didn’t think to verify manually

vkhemani · July 14, 2022, 4:16pm

@Anaisdg
Hello, I tried to read data at 1s and 1min interval. And i see that this error appears every sec or every min. In the InfluxDB, i see the data every 2sec or every 2min. when I change the interval to 2s or 30s, there is no problem. What could be the issue with 1s or 1min?

Anaisdg · July 18, 2022, 5:06pm

@vkhemani sorry wait, you changed what to 1s and 1min?

vkhemani · July 19, 2022, 1:18am

@Anaisdg , i have set the interval=1s in the global settings of telegraph to read data at every 1s interval from an OPC server. But, telegraph is reading data every 2s instead of 1s. I checked in the log and i see that, at every 1s, there is an error " collection took longer than expected…".

Same error at interval =1min. Telegraph is reading data at every 2min instead of 1min.

Thanks for your help

jpowers · July 20, 2022, 9:14pm

Hi @vkhemani,

The " collection took longer than expected…" error message means that the time it took for Telegraf to start collecting, send the request to the OPCUA server over the network, have the OPCUA server handle the request, respond to the request over the network, and then Telegraf process the response, was longer than the “1s” interval that you set. As a result, the next interval is skipped, hence why you see results every 2 seconds.

With an interval setting of 1s, you are expecting all the operations I listed above to happen within a second. This is entirely possible with many Telegraf plugins. However, based on some previous posts in this forum here and here and issue reports with OPC UA, it does seem that a 1s interval when interacting with OPC UA can be too small.

This is entirely dependent on the server you are connecting to, the network you are using to connect to it, and the amount of data you are querying at a given time. The first forum link above shows one possible workaround where you split up your opcua telegraf config as one option to possibly workaround this.

If you wanted someone to look at your config or Telegraf logs in more detail, I would suggest opening a new topic.

Hope that helps!

vkhemani · July 22, 2022, 4:13am

@jpowers Thanks for the reply. I went thru the links you sent. I suppose the way out is to split up the OPCUA config. I will test.

However, I have faced the same problem with 1min interval for a MODBUS config and there are very few data points. In this case, I get data at every 2 min. If I change the interval from 1min to 30 secs, it works fine and I get data every 30secs.

I also tested with 500ms interval in the case of OPC UA. In this case, I could read data in 1-1.5secs!

For some applications, it is important to read data in ms. Hence, I hope can achieve this using OPCUA+telegraf+influx.

Topic		Replies	Views
Telegraf-Interval issue Telegraf influxdb	31	1367	February 27, 2024
Influx > Telegraf write issue Telegraf influxdb , grafana	3	27	October 7, 2024
Unable to get SNMP , telegraf & influxdb InfluxDB 2 influxdb , telegraf	1	1962	June 1, 2021
Strange behavior in Telegraf SNMP plugin Telegraf telegraf , grafana	1	1013	November 1, 2017
[[inputs.internet_speed]] stops working Telegraf	1	464	November 28, 2022

Collection took longer than expected; not complete after interval of 10s

Related topics