Influx on one of our monitoring VM's( MON-0-2) periodically stops accepting metrics

blickj · March 12, 2019, 10:28am

Hello Community,

Influx on one of our monitoring VM’s( MON-0-2) periodically stops accepting metrics
There is no evidence of that in the influx logfile.
influx db is still running “influxdb process is running [ OK ]” and queryring the DB is still ok

Are there any debug commands or actions that can help us to find the rootcause of the issue when it occurs again?

Thanks for your help here.

influx --version
InfluxDB shell version: 1.7.2

telegraf --version
Telegraf 1.9.1 (git: HEAD 20636091)

Strace indicates ‘service not available’ issues internally in influx for the PID.

[dia4-4IF_LOC1 root@MON-0-2 influxdb]# strace -f -p 18110 -e trace=!epoll_pwait,nanosleep,futex,accept4,epoll_ctl,close,read,sched_yield,getsockname

[pid 18119] write(290, “HTTP/1.1 503 Service Unavailable”…, 354) = 354
[pid 22447] write(296, “HTTP/1.1 503 Service Unavailable”…, 373) = 373

We patched ‘max concurrent write limit’ from 20 to 40 on our MONITOR VM’s to see if this helps to avoid the issue but the issue is still there.

We see in the telegrag logs the next error

2019-03-11T03:18:04Z E! [outputs.influxdb] when writing to [http://MON-0-2:8086]: Post http://MON-0-2:8086/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

At this moment the issue is not occuring , so it’s working now more than 24 hours.

MarcV · March 12, 2019, 4:55pm

Hi ,
maybe a network problem ?
In telegraf there is a [[outputs.influxdb]] section which contains a default 5 seconds timeout ,
maybe increasing it to 10 seconds can help ?

 [[outputs.influxdb]]
  ## Timeout for HTTP messages.
  # timeout = "5s"

blickj · March 13, 2019, 7:39am

Hi MarcV,

I appreciate your quick reply.
At this moment the issue still has not occurred anymore which makes it a bit annoying at this moment.

I don’t see the [[outputs.influxdb]] section in the telegraf config file.
I assume we need to add this in /etc/telegraf/telegraf.conf ?
Here is our current configuration.

Telegraf Configuration

[global_tags]

[agent]
hostname = “MON-0-2”
interval = “10s”
round_interval = true
metric_buffer_limit = 10000
flush_buffer_when_full = true
collection_jitter = “0s”
flush_interval = “10s”
flush_jitter = “10s”
debug = false
quiet = false

OUTPUTS:

INPUTS:

[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.disk]]
ignore_fs = [“devtmpfs”]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

Kr,

Johan

MarcV · March 13, 2019, 7:47am

Hi Johan,
You can indeed add it so that you can override the defaults,

here you can find a complete configuration file
Telegraf configuration

You can also generate one as explained here
Generate configfile

In your /etc/telegraf you will probably find a copy of the original config file ?

blickj · March 13, 2019, 8:52am

Hello Marc,

According to my colleagues it seems to be no network issue because.
we can witness the problem with curl from localhost at the time the problem exists.
Any other suggestion what we can do when the issue pops up ?

Kr,

Johan

Topic		Replies	Views
Influxdb periodically stops accepting metrics Telegraf	2	499	March 18, 2019
Influxdb Error Please solution	10	21578	February 27, 2019
[agent] Error writing to outputs.influxdb: could not write any address Telegraf telegraf	3	9675	January 17, 2022
Telegraf logs errors while metrics are successfully written in InfluxDB Telegraf	8	5009	December 17, 2019
Telegraf not sending data to influxdb	8	15146	February 7, 2020

Influx on one of our monitoring VM's( MON-0-2) periodically stops accepting metrics

Telegraf Configuration

OUTPUTS:

INPUTS:

Related topics