Influx on one of our monitoring VM's( MON-0-2) periodically stops accepting metrics

Hello Community,

Influx on one of our monitoring VM’s( MON-0-2) periodically stops accepting metrics
There is no evidence of that in the influx logfile.
influx db is still running “influxdb process is running [ OK ]” and queryring the DB is still ok

Are there any debug commands or actions that can help us to find the rootcause of the issue when it occurs again?

Thanks for your help here.

influx --version
InfluxDB shell version: 1.7.2

telegraf --version
Telegraf 1.9.1 (git: HEAD 20636091)

Strace indicates ‘service not available’ issues internally in influx for the PID.

[dia4-4IF_LOC1 root@MON-0-2 influxdb]# strace -f -p 18110 -e trace=!epoll_pwait,nanosleep,futex,accept4,epoll_ctl,close,read,sched_yield,getsockname

[pid 18119] write(290, “HTTP/1.1 503 Service Unavailable”…, 354) = 354
[pid 22447] write(296, “HTTP/1.1 503 Service Unavailable”…, 373) = 373

We patched ‘max concurrent write limit’ from 20 to 40 on our MONITOR VM’s to see if this helps to avoid the issue but the issue is still there.

We see in the telegrag logs the next error

2019-03-11T03:18:04Z E! [outputs.influxdb] when writing to [http://MON-0-2:8086]: Post http://MON-0-2:8086/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

At this moment the issue is not occuring , so it’s working now more than 24 hours.

Hi ,
maybe a network problem ?
In telegraf there is a [[outputs.influxdb]] section which contains a default 5 seconds timeout ,
maybe increasing it to 10 seconds can help ?

 [[outputs.influxdb]]
  ## Timeout for HTTP messages.
  # timeout = "5s"

Hi MarcV,

I appreciate your quick reply.
At this moment the issue still has not occurred anymore which makes it a bit annoying at this moment.

I don’t see the [[outputs.influxdb]] section in the telegraf config file.
I assume we need to add this in /etc/telegraf/telegraf.conf ?
Here is our current configuration.

Telegraf Configuration

[global_tags]

[agent]
hostname = “MON-0-2”
interval = “10s”
round_interval = true
metric_buffer_limit = 10000
flush_buffer_when_full = true
collection_jitter = “0s”
flush_interval = “10s”
flush_jitter = “10s”
debug = false
quiet = false

OUTPUTS:

INPUTS:

[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.disk]]
ignore_fs = [“devtmpfs”]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

Kr,

Johan

Hi Johan,
You can indeed add it so that you can override the defaults,

here you can find a complete configuration file
Telegraf configuration

You can also generate one as explained here
Generate configfile

In your /etc/telegraf you will probably find a copy of the original config file ?

Hello Marc,

According to my colleagues it seems to be no network issue because.
we can witness the problem with curl from localhost at the time the problem exists.
Any other suggestion what we can do when the issue pops up ?

Kr,

Johan