We’re running InfluxDB 1.7.7 and ever since migrating from 1.3.x to 1.7.x, and moving to TSM and TSI, we’ve started seeing problems during TSM compactions.
Whenever a TSM compaction occurs, InfluxDB sometimes becomes momentarily unavailabe, leading to a dip and then a peak in writes to to database:
As you can see compactions are run every half an hour, but they sometimes are force to be run earlier by InfluxDB and sometimes they also don’t seem to have the dip-peak effect (see 07:30 in the graph).
Our system has one very large measurement containing (building) sensoric data. Each value is tagged with a sensor ID. We log data for each sensor about every 5-10 seconds, for a few thousand sensors.
Could the single large measurement be the source of our problems?
Is there nobody else who is experiencing problems like this?
It currently isn’t affecting our system that much, because we have enough capacity to rewrite the failed points and process the backlog of points that accumulate due to these timeouts. But it’s not something I’m fully comfortable with. Not accepting writes during a (slow?) compaction is IMHO not a great way to handle this from InfluxDB.
Thanks! I’ve been testing out some stuff as well, I have found that adjusting the compact_throughput up I’ve only managed to crash the server (excessive memory usage). Although I was also playing around with the cache settings so that might also have factored in with the memory problems I created.
Another look at the config file made me think that maybe the actual culprit (for most of the dip/peaks) is the retention policy enforcer. The dip/peak cycles are exactly every 30 minutes, and the retention policy enforcer is set to run every 30 minutes. This can’t explain all of the dips and peaks - not all of them occur exactly every 30 minutes - but it can explain most of them.
Whenever the retention policy enforcer runs I get these logging:
Doesn’t seem to run excessively long or do any weird things, but the times that are logged by InfluxDB line up exactly with the dip/peaks I see in my write throughput.
While I was typing this InfluxDB seems to have done a few dip/peak cycles that weren’t exactly on the 30m mark, but I do see the following log at the exact time of one of the dip/peaks:
So maybe it’s the retention policy enforcer combined with the TSM cache snapshotter? Note that not every non-30m dip/peak has a cache logging associated with it, and cache logs also appear without the dip/peak in my write throughput.
What’s weird to me is that if that’s true, why would InfluxDB just stop accepting writes when it is doing some IO to disk?