Out-of-memory on backup + queries

Purpose
I wanted to go and check backup / restore of a life Influx database.

Environment

influxdb 1.8.1 running in Docker on an Ubuntu 18.04 VM on a Windows 10 Pro Host.
Simultaneously, some queries are launched (read only) through the Python library from the Windows host.

Effect

  • Before start of backup:
    Queries run fine, about 2 seconds each.

  • Now start the backup (from within the Docker container):
    influxd backup -portable ./backup/

  • Queries start failing:
    Sometimes with this message:
    HTTPConnectionPool(host=‘192.168.123.4’, port=8086): Read timed out.
    Sometimes with the additional mentioning of the timeout value:
    HTTPConnectionPool(host=‘192.168.123.4’, port=8086): Read timed out. (read timeout=30).
    From client side, this is similar to my behaviour described in my previous post (that is not conclusively answered): Python query results in "Read timed out"

  • Backup just stops at a certain point.
    Backup directory contents is:
    -rw-r–r-- 1 root root 1296 Sep 16 08:56 20200916T085634Z.meta
    -rw------- 1 root root 14539886 Sep 16 08:56 20200916T085634Z.s3.tar.gz
    -rw------- 1 root root 409930032 Sep 16 08:57 20200916T085634Z.s4.tar.gz
    -rw------- 1 root root 380261790 Sep 16 08:57 20200916T085634Z.s5.tar.gz
    -rw-r–r-- 1 root root 259618302 Sep 16 09:04 dbMeas.autogen.00029.00.pending

  • Storage availability.
    The / has increased with 1GB which matches the size of the backup directory.
    Plenty of storage capacity remaining.

  • Memory:
    System has 16GB.
    Normal operations: about 4GB in use
    During backup + query: 16GB in use

  • After the issue:

    • Retry the queries: they still fail.
      They make the memory jump from 4GB to 16GB for a single query (in 92 seconds)
      The query then fails (with the timeout shown above)
      CPU goes to 100%
    • Kill the queries (Ctrl-C)
      Influx remains on high CPU and high memory for > 5 minutes.
      Then drops to normal levels (because it crashes and then automatically restarts).
    • Relaunch of the Linux host and restart of the InfluxDB docker causes:
      Influx to restart “normally”
      Once a query is launched, again high CPU/Mem and causes InfluxDB to restart
    • Upgrade from 1.8.1 to image 1.8.2 does not improve the issue.
    • Performing the query inside influx (on its command prompt), results in the same issue. It therefore does not seem related to the Python library or the version thereof.

Logging

  • Log at the moment of the backup operation
    2020-09-16T08:56:54.252852Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T10:00:00Z’"}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:54 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T10%3A00%3A00Z%27 HTTP/1.1” 200 7258 “-” “python-requests/2.24.0” 9175dc74-f7fa-11ea-af10-0242ac140003 2937973
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:58 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 93cb39e8-f7fa-11ea-af11-0242ac140003 15239
    2020-09-16T08:56:58.186329Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T11:00:00Z’”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:58 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T11%3A00%3A00Z%27 HTTP/1.1” 200 7256 “-” “python-requests/2.24.0” 93ce0f9e-f7fa-11ea-af12-0242ac140003 2999755
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:02 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 962ab686-f7fa-11ea-af13-0242ac140003 13399
    2020-09-16T08:57:02.165499Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T12:00:00Z’"}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:02 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T12%3A00%3A00Z%27 HTTP/1.1” 200 7258 “-” “python-requests/2.24.0” 962d3ca0-f7fa-11ea-af14-0242ac140003 3040121
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:06 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 98960130-f7fa-11ea-af15-0242ac140003 16187
    2020-09-16T08:57:06.227117Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T13:00:00Z’”}
    2020-09-16T08:57:08.000959Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.150366Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “149.425ms”}
    2020-09-16T08:57:08.150769Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “149.826ms”}
    2020-09-16T08:57:08.151126Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.151481Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “0.360ms”}
    2020-09-16T08:57:08.151796Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.673ms”}
    2020-09-16T08:57:08.152694Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.153043Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “0.354ms”}
    2020-09-16T08:57:08.153329Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.638ms”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:06 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T13%3A00%3A00Z%27 HTTP/1.1” 200 7527 “-” “python-requests/2.24.0” 9898fd9b-f7fa-11ea-af16-0242ac140003 1999587
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:09 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 9a5d355a-f7fa-11ea-af17-0242ac140003 6505
    2020-09-16T08:57:09.200308Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’"}
    2020-09-16T08:57:40.227054Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’”}
    2020-09-16T08:57:47.285878Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.199264Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “2913.408ms”}
    2020-09-16T08:57:50.199706Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “2913.848ms”}
    2020-09-16T08:57:50.200622Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.200997Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “0.383ms”}
    2020-09-16T08:57:50.201319Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.701ms”}
    2020-09-16T08:57:50.201668Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.201993Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “0.329ms”}
    2020-09-16T08:57:50.202309Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.643ms”}
    2020-09-16T08:58:11.649056Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’"}
    2020-09-16T08:58:44.167503Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’”}
    2020-09-16T08:59:17.649781Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’"}
    2020-09-16T08:59:53.787673Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’”}
    2020-09-16T09:00:29.261036Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T16:00:00Z’"}
    2020-09-16T09:01:23.629055Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:01:28.717779Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T16:00:00Z’”}
    2020-09-16T09:02:05.021257Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:09:01:10 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T16%3A00%3A00Z%27 HTTP/1.1” 200 60 “-” “python-requests/2.24.0” 2a633b2a-f7fb-11ea-af1f-0242ac140003 94571698
    2020-09-16T09:02:57.746893Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:03:35.089833Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:03:55.958611Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    fatal error: runtime: out of memory
    runtime stack:
    runtime.throw(0x166acb8, 0x16)
    /usr/local/go/src/runtime/panic.go:774 +0x72
    runtime.sysMap(0xc510000000, 0x4000000, 0x35aec78)
    /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
    runtime.(*mheap).sysAlloc(0x3595c80, 0x756000, 0xdc000, 0x0)
    /usr/local/go/src/runtime/malloc.go:701 +0x1cd
    (… cut a bit to stay in post limit…)
    /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
    runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1146

    goroutine 14069512 [running]:
    runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc004e74ed0 sp=0xc004e74ec8 pc=0x45ee90
    runtime.mallocgc(0x755160, 0x1570e40, 0x13e7401, 0xc00cb21a10)
    /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc004e74f70 sp=0xc004e74ed0 pc=0x40e405
    runtime.makeslice(0x1570e40, 0x0, 0x75516, 0xc2cf698000)
    /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc004e74fa0 sp=0xc004e74f70 pc=0x44789c
    github.com/influxdata/influxdb/tsdb/engine/tsm1.FloatValues.Merge(0xc00a29c000, 0x3e8, 0x3e8, 0xc2cf698000, 0x7512e, 0x7512e, 0x3e8, 0x3e8, 0x3e8)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:438 +0x15f fp=0xc004e75030 sp=0xc004e74fa0 pc=0x120dcbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*KeyCursor).ReadFloatBlock(0xc00adfd320, 0xc00cf311f0, 0x0, 0x123c295, 0xc0001fea90, 0xc06afeff50, 0xc0001fea90)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.gen.go:173 +0xba0 fp=0xc004e75150 sp=0xc004e75030 pc=0x12354a0
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatDescendingCursor(0x162cfeb0133ac000, 0x0, 0x0, 0x0, 0xc00adfd320, 0xc0001fea90)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:515 +0x107 fp=0xc004e751c0 sp=0xc004e75150 pc=0x12492b7
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatCursor(0x162cfeb0133ac000, 0x2549000, 0x0, 0x0, 0x0, 0xc00adfd320, 0x162cfeb0133ac000, 0xc00cf31000)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:372 +0xb6 fp=0xc004e75200 sp=0xc004e751c0 pc=0x1248b36
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildFloatCursor(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc0088c57e0, 0x15, 0x253e320, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.gen.go:20 +0x26b fp=0xc004e75490 sp=0xc004e75200 pc=0x121a7bb
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildCursor(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc002312ed0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2963 +0xa7d fp=0xc004e75690 sp=0xc004e75490 pc=0x123252d
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createVarRefSeriesIterator(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc00cf31130, 0x0, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2736 +0x1980 fp=0xc004e75a78 sp=0xc004e75690 pc=0x1231970
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetGroupIterators(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc0033fc550, 0x1, 0x1, 0xc00cf31130, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2691 +0x1af fp=0xc004e75c98 sp=0xc004e75a78 pc=0x122fbbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators.func1(0xc00ba630a0, 0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc02e260de0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2651 +0x17f fp=0xc004e75f78 sp=0xc004e75c98 pc=0x127a13f
    runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc004e75f80 sp=0xc004e75f78 pc=0x460f61
    created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2649 +0x399

  • Logging after restart, just launching query, no backup running
    2020-09-16T10:58:59.935121Z info InfluxDB starting {“log_id”: “0PHdvpsG000”, “version”: “1.8.1”, “branch”: “1.8”, “commit”: “af0237819ab9c5997c1c0144862dc762b9d8fc25”}
    2020-09-16T10:58:59.935291Z info Go runtime {“log_id”: “0PHdvpsG000”, “version”: “go1.13.8”, “maxprocs”: 1}
    2020-09-16T10:59:00.035977Z info Using data dir {“log_id”: “0PHdvpsG000”, “service”: “store”, “path”: “/var/lib/influxdb/data”}
    2020-09-16T10:59:00.036008Z info Compaction settings {“log_id”: “0PHdvpsG000”, “service”: “store”, “max_concurrent_compactions”: 1, “throughput_bytes_per_second”: 50331648, “throughput_bytes_per_second_burst”: 50331648}
    2020-09-16T10:59:00.036021Z info Open store (start) {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “op_event”: “start”}
    2020-09-16T10:59:00.041432Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/22/000000004-000000002.tsm”, “id”: 0, “duration”: “0.185ms”}
    2020-09-16T10:59:00.086220Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/22”, “duration”: “49.327ms”}
    2020-09-16T10:59:00.087278Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/23/000000003-000000002.tsm”, “id”: 0, “duration”: “0.159ms”}
    2020-09-16T10:59:00.114756Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/23”, “duration”: “27.953ms”}
    2020-09-16T10:59:00.115762Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/24/000000004-000000002.tsm”, “id”: 0, “duration”: “0.135ms”}
    2020-09-16T10:59:00.135778Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/24”, “duration”: “20.472ms”}
    2020-09-16T10:59:00.136771Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/25/000000001-000000001.tsm”, “id”: 0, “duration”: “0.111ms”}
    2020-09-16T10:59:00.156026Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/25”, “duration”: “19.678ms”}
    2020-09-16T10:59:00.157023Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/26/000000002-000000002.tsm”, “id”: 0, “duration”: “0.124ms”}
    2020-09-16T10:59:00.175908Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/26”, “duration”: “19.329ms”}
    2020-09-16T10:59:00.176966Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/27/000000003-000000002.tsm”, “id”: 0, “duration”: “0.131ms”}
    2020-09-16T10:59:00.196604Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/27”, “duration”: “20.103ms”}
    2020-09-16T10:59:00.197623Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/28/000000004-000000002.tsm”, “id”: 0, “duration”: “0.139ms”}
    2020-09-16T10:59:00.216059Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/28”, “duration”: “18.895ms”}
    2020-09-16T10:59:00.217132Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/31/000000001-000000001.tsm”, “id”: 0, “duration”: “0.105ms”}
    2020-09-16T10:59:00.217700Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00005.wal”, “size”: 10500255}
    2020-09-16T10:59:00.827089Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00006.wal”, “size”: 10487530}
    2020-09-16T10:59:01.423870Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00007.wal”, “size”: 10488877}
    2020-09-16T10:59:02.145961Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00008.wal”, “size”: 1135272}
    2020-09-16T10:59:02.303970Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/31”, “duration”: “2087.289ms”}
    2020-09-16T10:59:02.305407Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29/000000128-000000001.tsm”, “id”: 1, “duration”: “0.100ms”}
    2020-09-16T10:59:02.306038Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29/000000127-000000002.tsm”, “id”: 0, “duration”: “0.232ms”}
    2020-09-16T10:59:02.319848Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “15.269ms”}
    2020-09-16T10:59:02.320782Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3/000000005-000000001.tsm”, “id”: 1, “duration”: “0.064ms”}
    2020-09-16T10:59:02.321268Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3/000000004-000000002.tsm”, “id”: 0, “duration”: “0.092ms”}
    2020-09-16T10:59:02.339783Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3”, “duration”: “19.397ms”}
    2020-09-16T10:59:02.340906Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/30/000000032-000000003.tsm”, “id”: 0, “duration”: “0.130ms”}
    2020-09-16T10:59:02.359422Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/30”, “duration”: “19.102ms”}
    2020-09-16T10:59:02.360845Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4/000000128-000000001.tsm”, “id”: 1, “duration”: “0.100ms”}
    2020-09-16T10:59:02.361508Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4/000000127-000000002.tsm”, “id”: 0, “duration”: “0.230ms”}
    2020-09-16T10:59:02.379655Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4”, “duration”: “19.690ms”}
    2020-09-16T10:59:02.380972Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5/000000117-000000001.tsm”, “id”: 1, “duration”: “0.094ms”}
    2020-09-16T10:59:02.381680Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5/000000116-000000002.tsm”, “id”: 0, “duration”: “0.229ms”}
    2020-09-16T10:59:02.400102Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “19.946ms”}
    2020-09-16T10:59:02.400886Z info Open store (end) {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “op_event”: “end”, “op_elapsed”: “2364.865ms”}
    2020-09-16T10:59:02.401649Z info Opened service {“log_id”: “0PHdvpsG000”, “service”: “subscriber”}
    2020-09-16T10:59:02.402075Z info Starting monitor service {“log_id”: “0PHdvpsG000”, “service”: “monitor”}
    2020-09-16T10:59:02.402521Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “build”}
    2020-09-16T10:59:02.402904Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “runtime”}
    2020-09-16T10:59:02.403321Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “network”}
    2020-09-16T10:59:02.403733Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “system”}
    2020-09-16T10:59:02.404124Z info Starting precreation service {“log_id”: “0PHdvpsG000”, “service”: “shard-precreation”, “check_interval”: “10m”, “advance_period”: “30m”}
    2020-09-16T10:59:02.404590Z info Starting snapshot service {“log_id”: “0PHdvpsG000”, “service”: “snapshot”}
    2020-09-16T10:59:02.404660Z info Starting continuous query service {“log_id”: “0PHdvpsG000”, “service”: “continuous_querier”}
    2020-09-16T10:59:02.404682Z info Starting HTTP service {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “authentication”: true}
    2020-09-16T10:59:02.404699Z info opened HTTP access log {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “path”: “stderr”}
    2020-09-16T10:59:02.404716Z info Auth is enabled but shared-secret is blank. BearerAuthentication is disabled. {“log_id”: “0PHdvpsG000”, “service”: “httpd”}
    2020-09-16T10:59:02.404924Z info Listening on HTTP {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “addr”: “[::]:8086”, “https”: false}
    2020-09-16T10:59:02.405074Z info Starting retention policy enforcement service {“log_id”: “0PHdvpsG000”, “service”: “retention”, “check_interval”: “30m”}
    2020-09-16T10:59:02.405212Z info Listening for signals {“log_id”: “0PHdvpsG000”}
    2020-09-16T10:59:02.404284Z info Storing statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “db_instance”: “_internal”, “db_rp”: “monitor”, “interval”: “10s”}
    2020-09-16T10:59:02.406025Z info Sending usage statistics to usage.influxdata.com {“log_id”: “0PHdvpsG000”}
    [httpd] 127.0.0.1 - - [16/Sep/2020:10:59:32 +0000] “GET /ping HTTP/1.1” 204 0 “-” “InfluxDBShell/1.8.1” b3103bac-f80b-11ea-8001-0242ac140003 12463
    2020-09-16T10:59:42.046742Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’"}
    2020-09-16T11:00:12.755068Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’”}
    2020-09-16T11:00:44.204467Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’"}
    2020-09-16T11:01:16.295800Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T07:00:00Z’”}
    2020-09-16T11:06:51.709863Z info failed to store statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T11:07:40.786842Z info failed to store statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “error”: “timeout”}
    fatal error: runtime: out of memory
    runtime stack:
    runtime.throw(0x166acb8, 0x16)
    /usr/local/go/src/runtime/panic.go:774 +0x72
    runtime.sysMap(0xc5cc000000, 0x4000000, 0x35aec78)
    /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
    runtime.(*mheap).sysAlloc(0x3595c80, 0x125a000, 0x3595c80, 0x7f9253320360)
    /usr/local/go/src/runtime/malloc.go:701 +0x1cd
    (… cut a bit to stay in post limit…)
    runtime.systemstack(0x0)
    /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
    runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1146

    goroutine 4899 [running]:
    runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc04b50eed0 sp=0xc04b50eec8 pc=0x45ee90
    runtime.mallocgc(0x1259f90, 0x1570e40, 0x13e7401, 0xc00164d7a0)
    /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc04b50ef70 sp=0xc04b50eed0 pc=0x40e405
    runtime.makeslice(0x1570e40, 0x0, 0x1259f9, 0xc1d00b6000)
    /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc04b50efa0 sp=0xc04b50ef70 pc=0x44789c
    github.com/influxdata/influxdb/tsdb/engine/tsm1.FloatValues.Merge(0xc004f4e000, 0x3e8, 0x3e8, 0xc1d00b6000, 0x125611, 0x125611, 0x3e8, 0x3e8, 0x3e8)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:438 +0x15f fp=0xc04b50f030 sp=0xc04b50efa0 pc=0x120dcbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*KeyCursor).ReadFloatBlock(0xc027c56bd0, 0xc02d766ac0, 0x0, 0x123c295, 0xc001545520, 0xc004f90b10, 0xc001545520)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.gen.go:173 +0xba0 fp=0xc04b50f150 sp=0xc04b50f030 pc=0x12354a0
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatDescendingCursor(0x162c47556ad7c000, 0x0, 0x0, 0x0, 0xc027c56bd0, 0xc001545520)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:515 +0x107 fp=0xc04b50f1c0 sp=0xc04b50f150 pc=0x12492b7
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatCursor(0x162c47556ad7c000, 0x2549000, 0x0, 0x0, 0x0, 0xc027c56bd0, 0x162c47556ad7c000, 0xc02d766a00)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:372 +0xb6 fp=0xc04b50f200 sp=0xc04b50f1c0 pc=0x1248b36
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildFloatCursor(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc00160b7e0, 0x14, 0x253e320, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.gen.go:20 +0x26b fp=0xc04b50f490 sp=0xc04b50f200 pc=0x121a7bb
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildCursor(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc02fe0a1e0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2963 +0xa7d fp=0xc04b50f690 sp=0xc04b50f490 pc=0x123252d
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createVarRefSeriesIterator(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc02d766a50, 0x0, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2736 +0x1980 fp=0xc04b50fa78 sp=0xc04b50f690 pc=0x1231970
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetGroupIterators(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc001cb7e00, 0x1, 0x1, 0xc02d766a50, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2691 +0x1af fp=0xc04b50fc98 sp=0xc04b50fa78 pc=0x122fbbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators.func1(0xc0013cbb20, 0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc00d454300, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2651 +0x17f fp=0xc04b50ff78 sp=0xc04b50fc98 pc=0x127a13f
    runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc04b50ff80 sp=0xc04b50ff78 pc=0x460f61
    created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2649 +0x399

Additional info: Seems Ubuntu killed influx since it went out-of-memory.

Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3538.483705] Out of memory: Kill process 13910 (influxd) score 953 or sacrifice child
Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3538.484156] Killed process 13910 (influxd) total-vm:30931432kB, anon-rss:18434156kB, file-rss:0kB, shmem-rss:0kB
Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3539.178199] oom_reaper: reaped process 13910 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I’ll change some config and report back asap.

As far as the “out-of-memory” is concerned, a switch to tsi1 indexes solves the issue.

Next, I retried the combination of backup and running queries. The queries were barely influenced by the running backup and both still completed fine. No out-of-memory, no timeouts.

Is such a dramatic difference with “tsm” expected? My database only has two measurements, both with 1 series. First measurement has 1300 fields, second about 15. One month of data is about 2GB.

Conclusion: problem solved, root cause not completely understood.

… I was too fast…

It now just took quite a bit longer, but influx got still killed by Ubunto because of an out-of-memory-panic.
The system now just seemed to use additional swap memory before crashing, so it took longer.
Right before its dead, influx had about 22GB in use (of which 6GB in swap)

I’ve seen already a lot of requests concerning how to avoid out-of-memory on influxdb, but never a definitive answer from InfluxData. My desire: Influxd lives with the memory that I assign it and adjusts its behavior accordingly.

Using the TSI (on-disk) version will generally reduce memory pressure as TSI is then memory mapped as needed instead of residing entirely in memory. You have seen this already.

Unfortunately, there is no maximum memory setting. You will need to allocate more resources (memory) to your vm or reduce the complexity of your queries. Generally, the OS manages memory mapped files and will attempt to provide as much virtual space as possible.

Thanks for your answer.

Isn’t this very strange for a production-db?
That means that my customer can ask any question to the db and in some cases the db will just crash.
There is no possibility for me to:

  • either degrade in performance: the query still succeeds, but becomes slow
  • either fail gracefully: the user gets the reply that his query exceeded the available memory and the query is aborted, but all other queries/updates still run fine.

Do I understand this correctly? In that case, I guess we will not be adopting Influx.
Am I overlooking something in this? Maybe the version 2 (currently in beta) has such an option?

Johan