Hi, We have many buckets, the retention days is set to 60, thus the default shard group duration is 1d.
In some cased the newly created shard file (got corrupted?) cannot written by the metrics provider and only influx restart solve this issue. This always happen when there is a switch at 00:00 from old shard to new one. Beside we noticed that memory usage is increasing during this period
Could you please advise what could be the reason or how can we eliminate and/or monitor this situation ?
Influxdb OSS 2.7 running on Openshift with NFS storage
Thanks
Same issue what is described here (just found now)
opened 10:45AM - 29 Nov 22 UTC
__Steps to reproduce:__
List the minimal actions needed to reproduce the behavi… or.
1. Run influxdb2
2. Insert metrics (with telegraf)
3. Wait for some time
__Expected behavior:__
Things keep working
__Actual behavior:__
InfluxDB2 main index ("telegraf") stops reading/writing data.
Other indexes work fine - including one which is 5m aggregates of the telegraf raw index. (obviously this does not get any new data)
We have had this in the past randomly, but in the last few weeks has happened every few days.
In the past it seemed to happen at 00:00UTC when influx did some internal DB maintenance - but now happens at random times.
__Environment info:__
* System info: Linux 3.10.0-1160.66.1.el7.x86_64 x86_64
* InfluxDB version: InfluxDB v2.3.0+SNAPSHOT.090f681737 (git: 090f681737) build_date: 2022-06-16T19:33:50Z
* Other relevant environment details: CentOS 7 on vmware - lots of spare IO, CPU, memory.
Our database is 170GB, mostly metrics inserted every 60s, some every 600s.
storage_writer_ok_points is around 2.5k/s for 7mins, then ~25k/s for 3mins for the every-600s burst.
VM has 32G RAM, 28G of which is in buffers/cache.
4 cores, and typically sits at around 90% idle.
~ 24IOPS, 8MiB/s
__Config:__
```
bolt-path = "/var/lib/influxdb/influxd.bolt"
engine-path = "/var/lib/influxdb/engine"
flux-log-enabled = "true"
```
We have enabled flux-log to see if specific queries are causing this - but it doesn't seem to be.
__Logs:__
Include snippet of errors in log.
__Performance:__
I captured a 10s pprof which I will attach.
I also have a core dump, and a 60s dump of debug/pprof/trace (though not sure if this has sensitive info but can share privately - the core dump certainly will)
Hello @Suhanbongo ,
Thanks for posting the issue. If there’s an already issue I’d refer to that for help. Thank you