Collection took longer than expected; not complete after interval of 10s

Hello,

I want to use TIG with SQUID proxy server. I am following this “tuto”
GitHub - molu8bits/squid-grafana-monitoring: Monitor Squid Proxy Server using SNMP, Telegraf, Influxdb and view graphs as Grafana dashboard.

I have errors wtih telegraf:
août 16 15:57:10 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:10Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:20 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:20Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:30 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:30Z W! [inputs.snmp] Collection took longer than expected; not complete after interval of 10s
août 16 15:57:30 l-adm-proxy-01r telegraf[70560]: 2021-08-16T13:57:30Z E! [inputs.snmp] Error in plugin: agent myipserver:3401: performing get on field uptime: request timeout (after 3 retries)

But i don’t know exactly how to fix that…i opened 3401 in udp
and tcp (for testing) but not better
so i don’t understand what i do fix here

I use Virtual Machine with red hat 8 and not docker.

Can u help me please ?
Thks :wink:

Hello @steph-ipi-69,
You may want to experiment with flush_jitter = "5s" or metric_batch_size = 5000 , change one item at a time and measure before and after to judge the impact.
How many metrics are you trying to collect and write?
@popey, do you have any suggestions here as well?

1 Like

Hello @Anaisdg

OK
thks for reply.
Do i put theses Line in telegraf.conf right ?
I test tomorow :slight_smile:

I want for the moment try to collect around 15 metrics…

Thks a lot

Hello @steph-ipi-69,
They exist in the agent portion of your config. You might have to increase them. Although it sounds like you wont have to increase the batch size if you’re only tryin to collect around 15 metrics.
15 metrics at what collection interval?

I dont remember have interval in telegraf.conf

This is telegraf.conf :

[outputs]
[outputs.influxdb]
    url = "http://localhost:8086"
    database = "telegraf"

[[inputs.snmp]]
  agents = [ "YOUR_SQUID_IP_ADDRESS:3401" ]
  version = 2
  community = "public"
  name = "snmpsquid"

 [[inputs.snmp.field]]
    name = "uptime"
    oid = "1.3.6.1.4.1.3495.1.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheVersionId"
    oid = "1.3.6.1.4.1.3495.1.2.3.0"

 [[inputs.snmp.field]]
    name = "cacheMemMaxSize"
    oid = "1.3.6.1.4.1.3495.1.2.5.1.0"
 [[inputs.snmp.field]]
    name = "cacheMemUsage"
    oid = "1.3.6.1.4.1.3495.1.3.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheCpuUsage"
    oid = "1.3.6.1.4.1.3495.1.3.1.5.0"

 [[inputs.snmp.field]]
    name = "cacheSwapMaxSize"
    oid = "1.3.6.1.4.1.3495.1.2.5.2.0"
 [[inputs.snmp.field]]
    name = "cacheCurrentSwapSize"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.14.0"

 [[inputs.snmp.field]]
    name = "cacheClients"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.15.0"

 [[inputs.snmp.field]]
    name = "cacheProtoClientHttpRequests"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.1.0"

 [[inputs.snmp.field]]
    name = "cacheHttpHits"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.2.0"

 [[inputs.snmp.field]]
    name = "cacheHttpErrors"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.3.0"

 [[inputs.snmp.field]]
    name = "cacheHttpInKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.4.0"

 [[inputs.snmp.field]]
    name = "cacheHttpOutKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.5.0"

 [[inputs.snmp.field]]
    name = "cacheServerInKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.12.0"

 [[inputs.snmp.field]]
    name = "cacheServerOutKb"
    oid = "1.3.6.1.4.1.3495.1.3.2.1.13.0"

Perhaps in grafana ?

{
“cacheTimeout”: null,
“colorBackground”: true,
“colorValue”: false,
“colors”: [
#299c46”,
“rgba(237, 129, 40, 0.89)”,
#d44a3a
],
“datasource”: “${DS_TELEGRAF}”,
“format”: “s”,
“gauge”: {
“maxValue”: 100,
“minValue”: 0,
“show”: false,
“thresholdLabels”: false,
“thresholdMarkers”: true

Hello;

i changed telegraf config file. i added my value inputs and outputs in default config file.
and i tested and same message “collection took”.
i followed your help and i put your exemple value (5s) and same message but instead of 10s message it’s for 5s :’(
i changed for put more time…

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 5000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "5s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

And now this is the new message…

août 17 11:35:40 l-adm-proxy-01r telegraf[78167]: 2021-08-17T09:35:40Z E! [inputs.snmp] Error in plugin: agent myip:3401: performing get on field uptime: request timeout (after 3 retries)

The interval settings define also the time interval for the whole data gathering process (as you might have noticed).
Now that you set it to 60s, you are one step closer to the source issue as you get a client side timeout error.

Do you get a timeout if you try to get the data “manually” from the same machine used by telegraf?

About what can be the issue I have no idea as network errors can be caused by a huge amount of different causes.

Hello, thks for reply

I diddnt tested manually…
And i dont Know how to test.

With snmpwalk ?
Thks

Hello everyone,

Shame on me:
my squid conf was not ok.
i puted acl snm_pnet src 127.0.0.1 instead acl snmpnet src 127.0.0.1
thks Giovanni_Luisotto for your help i didn’t think to verify manually :frowning:

1 Like

@Anaisdg
Hello, I tried to read data at 1s and 1min interval. And i see that this error appears every sec or every min. In the InfluxDB, i see the data every 2sec or every 2min. when I change the interval to 2s or 30s, there is no problem. What could be the issue with 1s or 1min?

@vkhemani sorry wait, you changed what to 1s and 1min?

@Anaisdg , i have set the interval=1s in the global settings of telegraph to read data at every 1s interval from an OPC server. But, telegraph is reading data every 2s instead of 1s. I checked in the log and i see that, at every 1s, there is an error " collection took longer than expected…".

Same error at interval =1min. Telegraph is reading data at every 2min instead of 1min.

Thanks for your help

Hi @vkhemani,

The " collection took longer than expected…" error message means that the time it took for Telegraf to start collecting, send the request to the OPCUA server over the network, have the OPCUA server handle the request, respond to the request over the network, and then Telegraf process the response, was longer than the “1s” interval that you set. As a result, the next interval is skipped, hence why you see results every 2 seconds.

With an interval setting of 1s, you are expecting all the operations I listed above to happen within a second. This is entirely possible with many Telegraf plugins. However, based on some previous posts in this forum here and here and issue reports with OPC UA, it does seem that a 1s interval when interacting with OPC UA can be too small.

This is entirely dependent on the server you are connecting to, the network you are using to connect to it, and the amount of data you are querying at a given time. The first forum link above shows one possible workaround where you split up your opcua telegraf config as one option to possibly workaround this.

If you wanted someone to look at your config or Telegraf logs in more detail, I would suggest opening a new topic.

Hope that helps!

@jpowers Thanks for the reply. I went thru the links you sent. I suppose the way out is to split up the OPCUA config. I will test.

However, I have faced the same problem with 1min interval for a MODBUS config and there are very few data points. In this case, I get data at every 2 min. If I change the interval from 1min to 30 secs, it works fine and I get data every 30secs.

I also tested with 500ms interval in the case of OPC UA. In this case, I could read data in 1-1.5secs!

For some applications, it is important to read data in ms. Hence, I hope can achieve this using OPCUA+telegraf+influx.