InfluxDB 1.8 Occasionally really long cache snapshots

So… we’re using InfluxDB 1.8. In general, it runs great. We’re having no problems with it. But, we are having one minor issue after we’ve had an increase in write loads.

A few times an hour, we started receiving log messages about write timeouts. Even during periods with no CQs and no queries at all! The only thing we’ve noticed is that when these happen, they occur in close proximity to a snapshot cache process that takes over 10 seconds.

The vast majority of snapshots take less than a second. But, occasionally, 10+ seconds.

We’re using all default configuration for most things (except turning off tag limits, disabling internal monitoring).

We have no error logs regarding out of cache memory or really any errors at all. The log is filled with nothing but writes, shapshots and TSM compressions.

We’re running InfluxDB on an AWS Fargate instance with an EFS disk that has plenty of additional capacity. We’re able to drive InfluxDB much harder when needed and seem to find any correlation to explain why the snapshots occasionally take so long.

Any ideas?