Out-of-memory on backup + queries

@qootec - Johan -

Sorry for my delayed reply, I’ve been on holiday. It’s true that we aren’t satisfied with InfluxDB 1.x memory management options. It’s something we’re working on actively for the next major storage engine architecture. We have as goals a maximum that the system operates within and degradation without crashing. InfluxDBv2 (release candidate) has some controls for the memory that queries can use and the number of queries that can be executing at once. See this PR. The default amount is unlimited so adjust the flag query-memory-bytes down to something in the 100s of gb range to start with.

On v1.8, there is also a query timeout value on the server side which will kill queries that are taking too long, such as in your case where the system is taxed with running backups too. This is different than the read time out in the client as you likely know!

There are a few other suggestions in this post. I think series-id-set-cache-size might be the most useful to you. I don’t know how many tags you have, but if you have many tags creating many series this cache could be large. I don’t think you’ll gain much (~1gb max) from adjusting down the TSM cache size since it’s only about 6% of your available memory, but it could help a some.

My coworker did point out that 1300 fields within a single measurement is a lot. The query SELECT last(*) is materializing all of them into memory. I believe computationally this query looks at each field individually for the newest value. If you can subset this query, it will perform better. Are you running this every 4 seconds?

How many shards do you have? Information about your data schema and the queries you are interested in running would be helpful. It does seem to me that the problem isn’t in the queries but in the backup happening concurrently. There are options to segment the backup process by time range, shard, and/or retention policy. Reducing what you backup at once should help: Back up and restore data in InfluxDB v1.8 | InfluxDB OSS 1.8 Documentation

Let us know what you figure out. You can try getting a pprof heap profile too when the memory use is high.

I doubt this makes you feel better, but you are not alone here: Issues · influxdata/influxdb · GitHub

Further reading; but you already likely have a firm grasp of this: In-memory indexing and the Time-Structured Merge Tree (TSM) | InfluxDB OSS 1.8 Documentation

1 Like