Influx crashes/stop showing data and we can't query

Kasper_Vinding · July 27, 2022, 8:59pm

Hi Guys,

We have a setup where InfluxDB is running all out telegraf data from 100+ servers, which started this monday by “crashing” and stops taking request from both OSS and Grafana.
And i have no idea on how to troubleshoot this
I checked the usual suspects disk, cpu, mem.
Rebooting the service makes it from for another couple of hours before it fails again.

So this is the setup.
Server: Windows 2016, 16gb mem, 8cores, lots of disk
InfluxDB: 2.3.0 using NSSM as a windows service
stdout to a log file.
InfluxDB on disk is 26gb.

config.yaml:
flux-log-enabled: true
log-level: debug

stdou dosent give any errors, other than:
ts=2022-07-27T11:58:35.365862Z lvl=debug msg=Request log_id=0bwmagIG000 service=http method=POST host=XXXXXXXX:8086 path=/api/v2/write query=“bucket=Telegraf&org=XXXXX” proto=HTTP/1.1 status_code=499 response_size=107 content_length=-1 referrer= remote=10.1.222.207:57120 user_agent=Go took=10021.650ms error=“internal error” error_code=“internal error”

Thanks guys

Anaisdg · August 1, 2022, 5:53pm

Hello @Kasper_Vinding,
I’m sorry, that’s frustrating. I don’t know quite yet, I’m asking around.
Thank you for your patience.

Can you provide the full logs?

Kasper_Vinding · August 2, 2022, 7:26am

The last thing which happened today before it died was this. this is the first instance of internal error today.

ts=2022-08-02T07:00:23.291873Z lvl=info msg=“index opened with 8 partitions” log_id=0bz8m8KW000 service=storage-engine index=tsi
ts=2022-08-02T07:00:23.291873Z lvl=info msg=“Reindexing TSM data” log_id=0bz8m8KW000 service=storage-engine engine=tsm1 db_shard_id=216
ts=2022-08-02T07:00:23.291873Z lvl=info msg=“Reindexing WAL data” log_id=0bz8m8KW000 service=storage-engine engine=tsm1 db_shard_id=216
ts=2022-08-02T07:00:23.301954Z lvl=info msg=“Write failed” log_id=0bz8m8KW000 service=storage-engine service=write shard=216 error=“engine: context canceled”
ts=2022-08-02T07:00:23.301954Z lvl=debug msg=Request log_id=0bz8m8KW000 service=http method=POST host=xxxxxxxx:8086 path=/api/v2/write query=“bucket=Telegraf&org=xxxxx” proto=HTTP/1.1 status_code=499 response_size=107 content_length=-1 referrer= remote=[fe80::f5a3:187e:a09d:7d68%Ethernet0]:52009 user_agent=Telegraf took=10025.780ms error=“internal error” error_code=“internal error”

Kasper_Vinding · August 2, 2022, 7:35am

stdout.txt.gz (4.6 MB)

Here is the full log from today where it crashed, and we rebooted it after.

Anaisdg · August 2, 2022, 6:59pm

Hello @Kasper_Vinding,
How are you writing data?

Kasper_Vinding · August 2, 2022, 8:11pm

with telegraf, and the API. @Anaisdg

Topic		Replies	Views
Sudden Spike in query requests killed server InfluxDB 2 influxdb , telegraf , grafana	0	412	April 12, 2022
Need help in troubleshooting Influxdb load issue influxdb , telegraf	4	974	October 22, 2021
InfluxDB crashing without any visible reasons InfluxDB 2 influxdb , grafana	9	4083	January 31, 2022
InfluxDB 2 v2.7.0 Freezes Time After Time InfluxDB 2	1	842	October 13, 2023
TIG implementation over multiple raspberry Pi’s to moniter NFS (part 2) Telegraf influxdb , telegraf , grafana , flux	2	459	November 8, 2021

Influx crashes/stop showing data and we can't query

Related topics