Influx on one of our monitoring VM’s( MON-0-2) periodically stops accepting metrics
There is no evidence of that in the influx logfile.
influx db is still running “influxdb process is running [ OK ]” and queryring the DB is still ok
Are there any debug commands or actions that can help us to find the rootcause of the issue when it occurs again?
Thanks for your help here.
InfluxDB shell version: 1.7.2
Telegraf 1.9.1 (git: HEAD 20636091)
Strace indicates ‘service not available’ issues internally in influx for the PID.
[dia4-4IF_LOC1 root@MON-0-2 influxdb]# strace -f -p 18110 -e trace=!epoll_pwait,nanosleep,futex,accept4,epoll_ctl,close,read,sched_yield,getsockname
[pid 18119] write(290, “HTTP/1.1 503 Service Unavailable”…, 354) = 354
[pid 22447] write(296, “HTTP/1.1 503 Service Unavailable”…, 373) = 373
We patched ‘max concurrent write limit’ from 20 to 40 on our MONITOR VM’s to see if this helps to avoid the issue but the issue is still there.
We see in the telegrag logs the next error
2019-03-11T03:18:04Z E! [outputs.influxdb] when writing to [http://MON-0-2:8086]: Post http://MON-0-2:8086/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
At this moment the issue is not occuring , so it’s working now more than 24 hours.