I’ve had some of my InfluxDB systems repeatedly experience OOMs and shard corruption apparently resulting from either increases in data ingest that inhibit timely compaction and/or disk outages (perhaps themselves caused by failing compactions).
Are there general recommendations on how to avoid this sort of problem? I mean, I assume one answer is “don’t put too much stuff too fast into your DB”. Is that a – if you will – cardinal sin? Is the way to deal with that simply to “just control your data sources”?
Are there other recommendations to help build resilience in the face of difficult-to-control data sources?
Any docs on best practices generally or specifically regarding this issue?
Thanks for your thoughts on this.
Hello @Raymond_Keller,
Generally, yes–I’d say you’re right. It’s important to remember, InfuxDB will use whatever memory is available to it in order to optimise reads and writes. However, I can think of the following recommendations:
- make sure TSI is enabled
- monitor your influxdb instance with another instance
- take a look at metric_buffer_limit if you’re using telegraf
- have good retention policies
- reduce cardinality if possible
1 Like
This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.