I was going through memory management of InfluxDB and fail to understand how memory can grow unbounded. I am sure i am missing something and hence needed clarification.
Excerpts from InfluxDB documentation below:
#1 - The WAL is where incoming writes hit initially. “WAL” file structure that allows writes to be appended to the file. The WAL is organized as a bunch of files that look like _000001.wal . The file numbers are monotonically increasing and referred to as WAL segments. When a segment reaches 10MB in size, it is closed and a new one is opened. Each WAL segment stores multiple compressed blocks of writes and deletes.
#2 - The Cache is an in-memory copy of all data points current stored in the WAL. The points are organized by the key, which is the measurement. When a snapshot compaction occurs, the values in the cache are written to a new TSM file and the associated WAL segments are removed. There is a lower bound, [ cache-snapshot-memory-size ] which when exceeded will trigger a snapshot to TSM files and remove the corresponding WAL segments.cache-snapshot-memory-size = "25m"
The size at which the engine will snapshot the cache and write it to a TSM file, freeing up memory.
If the above highlighted statements in #2 are taken care of then, how does Memory keep growing?
It would be helpful if anyone can clarify on the same.
I’m happy to explain beyond this brief answer - please ask followup questions or check this answer as the solution if you’re all set.
You are 100% correct that the cache is memory limited (within a margin), but there are other memory users. A dominate use of memory can be to serve queries. I say “can” because it depends entirely on your query and data shape for how much memory will be used to create a query response. This kind of memory usage is partially heap based and transient unless you are running queries on a schedule. Then it becomes regular. Query execution also makes significant use of memory mapped files. The process of actually reading a TSM file is done by mapping segments into memory. This usage can also be significant. So on the write path, memory usage is generally limited to the cache and some supporting heap. The read (query) path has the potential to use a significant amount of memory. If you don’t query your data, then you won’t see this memory usage. Presumably you are storing data to later to query it.
If you want to share your usage in general terms, I am interested and it might help me tailor my responses here.
Thanks @philjb. Would help if you can clarify on my follow-up question.
So if we support multiple databases, and we have query (scheduled) for only a couple of them, then I presume that memory would be limited to only those couple of databases which supports query.
Also another scenario, say I am writing 500 points/Min to a database and I have have a scheduled query every 5min (eg: to, t5, t10, t15 etc…). When does Influx decide to flush out memory related to t0 and t5. Is this tunable?
Just one follow up question. So irrespective of how many points is written to Influx, the max memory is bound to (num-databases * cache-snapshot-memory-size) + some supporting heap. Is my understanding correct?
In our environment, we have 10 databases among which there are only 4 key databases to which metrics are pumped on a regular basis. Also, among the 4 databases, only 1 database has a scheduled query and others are based on a user query which is rare (we can ignore in this discussion). I see that my RAM in 2 days is increasing > 1GB.
Also, one key point is that to the databases which are not queried (only writes are triggered), the field value is quite large, its >10KB but way lesser than the limit of 64KB.
A disclaimer: since we’re asking questions in general (about 1.8x), my answers are in general and should be used qualitatively.
InfluxDB is a golang program and the runtime time schedules when GCs happen and memory objects are cleaned up. I expect “most” memory used for an action (query, write, task, etc) to be free-able when the action completes, but some may be retained in memory pools, waiting on the GC, etc. If you’re watching a memory usage chart it can be hard to correlate a rise in memory with a specific action and likewise a decrease with an action’s completion. You’d need to take memory profiles to get a true picture of where memory is being used at an instant.
You asked about database and memory isolation. If you’re running multiple instances of InfluxDB, memory usage is contained within each instance. Within an instance, I believe that the WAL and cache are shared across buckets/databases, but there is a TSM file on disk for each bucket/retention policy.
Technically, memory usage is unbounded as there’s no mechanism to limit it to a bound in total. In practice you should see an empirical max given a constant data volume and query complexity. If you’re adding more and more data, those TSM files will be memory mapped to service queries so more and more data will enter memory. Mmap’ing is handled by the OS, if there is no other demand on the system, the OS may keep them resident until pressure increases.
Some have used the golang setting of GODEBUG=madvdontneed=1 but my understanding is this is less efficient for the OS, but the OS will mark memory immediately as out of the resident set for InfluxDB instead of only when it is needed by the OS for other purposes.
Feel free to keep asking questions. I suggest you look at profiling memory if you’re troubleshooting something specific.