Influx crashes/stop showing data and we can't query

Hi Guys,

We have a setup where InfluxDB is running all out telegraf data from 100+ servers, which started this monday by “crashing” and stops taking request from both OSS and Grafana.
And i have no idea on how to troubleshoot this :frowning:
I checked the usual suspects disk, cpu, mem.
Rebooting the service makes it from for another couple of hours before it fails again.

So this is the setup.
Server: Windows 2016, 16gb mem, 8cores, lots of disk
InfluxDB: 2.3.0 using NSSM as a windows service
stdout to a log file.
InfluxDB on disk is 26gb.

flux-log-enabled: true
log-level: debug

stdou dosent give any errors, other than:
ts=2022-07-27T11:58:35.365862Z lvl=debug msg=Request log_id=0bwmagIG000 service=http method=POST host=XXXXXXXX:8086 path=/api/v2/write query=“bucket=Telegraf&org=XXXXX” proto=HTTP/1.1 status_code=499 response_size=107 content_length=-1 referrer= remote= user_agent=Go took=10021.650ms error=“internal error” error_code=“internal error”

Thanks guys

Hello @Kasper_Vinding,
I’m sorry, that’s frustrating. I don’t know quite yet, I’m asking around.
Thank you for your patience.

Can you provide the full logs?

The last thing which happened today before it died was this. this is the first instance of internal error today.

ts=2022-08-02T07:00:23.291873Z lvl=info msg=“index opened with 8 partitions” log_id=0bz8m8KW000 service=storage-engine index=tsi
ts=2022-08-02T07:00:23.291873Z lvl=info msg=“Reindexing TSM data” log_id=0bz8m8KW000 service=storage-engine engine=tsm1 db_shard_id=216
ts=2022-08-02T07:00:23.291873Z lvl=info msg=“Reindexing WAL data” log_id=0bz8m8KW000 service=storage-engine engine=tsm1 db_shard_id=216
ts=2022-08-02T07:00:23.301954Z lvl=info msg=“Write failed” log_id=0bz8m8KW000 service=storage-engine service=write shard=216 error=“engine: context canceled”
ts=2022-08-02T07:00:23.301954Z lvl=debug msg=Request log_id=0bz8m8KW000 service=http method=POST host=xxxxxxxx:8086 path=/api/v2/write query=“bucket=Telegraf&org=xxxxx” proto=HTTP/1.1 status_code=499 response_size=107 content_length=-1 referrer= remote=[fe80::f5a3:187e:a09d:7d68%Ethernet0]:52009 user_agent=Telegraf took=10025.780ms error=“internal error” error_code=“internal error”

stdout.txt.gz (4.6 MB)

Here is the full log from today where it crashed, and we rebooted it after.

Hello @Kasper_Vinding,
How are you writing data?

with telegraf, and the API. @Anaisdg