Memory leak in InfluxDB 1.7.4?

OS: Debian 9.8 (stretch)
Version: InfluxDB 1.7.4

Since I updated from 1.7.3 to 1.7.4, I’ve had runaway memory consumption from the influxdb process, steadily growing to consume all available memory until restarted. This seems to be new behavior, but I don’t see any other threads discussing this. See annotated screenshot from my system monitoring below. (Restarts free up all locked memory – the chart doesn’t show this because of datapoint decimation.)

Data source is Icinga 2 performance data.

Any similar observations, or pointers on options to debug?

influx

1 Like

Anyone have thoughts here? I’ve taken to restarting influxdb hourly from cron, which is hardly a fix.

I’d downgrade to 1.7.3, but the release notes indicate that version has a high likelyhood of losing data. I’m not sure if it’s safe or possible to roll back to 1.7.2.

Sorry , no idea , as you said so far no other threads discussing this.

Thanks Marc. Someone else commented and then deleted with a similar use case; I don’t know if they resolved their problem. I wonder if an Icinga2 update (concurrent with the InfluxDB update) is now submitting data in a way that is problematic, but I can’t come up with a likely scenario.

What index type are you using? Depending on the data structure, it could be using to much memory.

If you are using inmem for your index type, you should switch to TSI and check the results.

I’ve been serially updating since 1.0.2 (or possibly earler), so TSI did not yet exist. I’ll switch and see if that helps, but I think this is still indicative of a 1.7.4 bug because my dataset is time-limited.) Thanks for the suggestion, though – I’m sure it will help!

This did not resolve my issue, unfortunately.

It appears that both influxd and icinga2 are growing in size over time, so it’s possible it is an interaction between versions of both applications. Will continue to attempt to categorize.

Update: Switching to TSI has made the memory leak rate worse. I now have to restart influx every 6 hours to avoid taking down the machine. Any sort of trace guidance would be helpful…

Can you tell use some metrics? Number of series, number of measurements, number of databases, cardinality, size of data stored, etc

> show databases
name: databases
name
----
_internal
icinga2
vdo

> show retention policies on _internal
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
monitor 168h0m0s 24h0m0s            1        true

> show series cardinality on _internal
cardinality estimation
----------------------
1494

> show retention policies on icinga2
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 672h0m0s 24h0m0s            1        true

> show series cardinality on icinga2
cardinality estimation
----------------------
745

> show retention policies on vdo
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

> show series cardinality on vdo
cardinality estimation
----------------------
12

I’m not clear on how to get number of measurements from InfluxDB: it looks to not be possible from the query language? Here’s the on-disk size:

# du -sk * /var/lib/influxdb/data/*
264996	icinga2
72712	_internal
239512	vdo

So, overall, nothing particularly large.

Are there other things running on this VM?

Also, what do the logs say? What is killing Influx? OOM Killer?

Nothing else using an unusual amount of memory. The largest is Icinga2. Beyond that there’s small things to support monitoring and visualization: mysql (for icinga’s config), apache, saslauthd, postfix, grafana.

The culprit is clearly influxd – it grows slowly from about 6% of memory to 60-70%. I haven’t let the OOM-killer get it yet because I get alerting on low memory, and I’ve added a cron job that just restarts influxd every 6 hours (which has solved the problem for very low values of “solved”).

Neither influx or icinga logs say much interesting. Influx is doing detailed httpd logging now – I’ll turn that off.

A few years ago I ran into a problem where a version of Icinga2 wouldn’t reconnect after losing an SSL connection to influxd, queue up data, and eventually blow up. That’s not the case here – influxd is the big process, not icinga2.

sigh This may yet be Icinga2 related. They just released 2.10.4 today with a changelog entry of “Fix TLS connections in Influxdb/Elasticsearch features leaking file descriptors (#6989 #7018 ref/IP/12219)”. I’ll report back if this resolves problems.